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Preface 



This volume contains the final versions of papers originally given at the workshop 
Vision Algorithms : Theory and Practice, which was held on 21-22 September 1999 
during the Seventh International Conference on Computer Vision at the Corfu Holiday 
Palace Hotel in Kanoni, Corfu, Greece. 

The subject of the workshop was algorithmic issues in computer vision, and espe- 
cially in vision geometry: correspondence, tracking, structure and motion, and image 
synthesis. Both theoretical and practical aspects were considered. A particular goal was 
to take stock of the ‘new wave’ of geometric and statistical techniques that have been 
developed over the last few years, and to ask which of these are proving useful in real 
applications. To encourage discussion, we asked the presenters to stand back from their 
work and reflect on its context and longer term prospects, and we encouraged the audi- 
ence to actively contribute questions and comments. The current volume retains some of 
the flavour of this, as each paper is followed by a brief edited transcript of the discussion 
that followed its presentation. 

The theme was certainly topical, as we had 65 submitted papers for only 15 places (an 
acceptance rate of only 23%), and around 100 registered participants in all (nearly 1/3 of 
the ICCV registration). With so many submissions, there were some difficult decisions to 
make, and our reviewers deserve many thanks for their thoroughness and sound judgment 
in paper evaluation. As several authors commented, the overall quality of the reviews was 
exceptionally high. The accepted papers span the full range of algorithms for geometric 
vision, and we think that their quality will speak for itself. 

To complement the submitted papers, we commissioned two invited talks “from the 
shop floor”, two “expert reviews” on topical technical issues, and a panel session. 

The invited talks were by two industry leaders with a great deal of experience in 
building successful commercial vision systems: 

- Keith Hanna of the Sarnoff Corporation described Sarnoff’s real time video align- 
ment and annotation systems, which are used routinely in applications ranging from 
military reconnaissance to inserting advertisements and annotations on the Super 
Bowl field. This work is presented in the paper Annotation of Video by Alignment 
to Reference Imagery on page 253). 

- Luc Robert of REALViZ S.A. described REALViZ’s MatchMover and ReTimer 
post-production systems for movie special effects, which are used in a number of 
large post-production houses. Unfortunately there is no paper for this presentation, 
but the discussion that followed it is summarized on page 265. 

Both presenters tried to give us some of the fruits of their experience in the difficult art 
of “making it work”, illustrated by examples from their own systems. 

The two “expert reviews” were something of an experiment. Each was a focused 
technical summary prepared jointly by a small team of people that we consider to be 
domain experts. In each case, the aim was to provide a concise technical update and state 
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of the art, and then to discuss the advantages of the various implementation choices in 
a little more depth. 

The motivation for these review sessions was as follows. As active members of the 
vision community and referees of many papers, we continually find that certain basic 
topics are poorly understood. This applies particularly to areas where a cultural split has 
occurred, with two or more camps following more or less separate lines of development. 
There are several such splits in the vision community, and we feel that every effort must 
be made to heal them. For one thing, it is fruitless for one group to reduplicate the 
successes and failures of another, or to continue with a line of research that others know 
to be unprofitable. More positively, intercommunication breeds innovation, and it is often 
at the boundaries between fields that the most rapid progress is made. The workshop 
as a whole was intended to take stock of the rapid progress made in vision geometry 
over the past decade, and hopefully to narrow the gap between “the geometers” and “the 
rest”. Within this scope, we singled out the following two areas for special treatment: (/) 
the choice between direct and feature-based correspondence methods ; and (ii) bundle 
adjustment. 

Direct versus feature-based correspondence methods: One of the significant splits 
that has emerged in the vision community over the past 15-20 years is in the analysis of 
image sequences and multi-view image sets. Two classes of techniques are used: 

- “Feature-based” approaches: Here, the problem is broken down into three stages : 
(i) local geometric features are extracted from each image (e.g. “points of interest”, 
linear edges . . . ) ; (ii) these features are used to compute multi-view relations, such 
as the epipolar geometry, and simultaneously are put into correspondence with one 
another using a robust search method; (Hi) the estimated multi-view relations and 
correspondences are used for further computations such as refined correspondences, 
3D structure recovery, plane recovery and alignment, moving object detection, etc. 

- “Direct” approaches: Here, rather than extracting isolated features, dense spatio- 
temporal variations of image brightness (or color, texture, or some other dense 
descriptor) are used directly. Instead of a combinatorial search over feature corre- 
spondences, there is a search over the continuous parameters of an image motion 
model (translation, 2D affine, homographic), that in principle establishes dense cor- 
respondences as well as motion parameters. Often, a multi-scale search is used. 

The experts in this session were P. Anandan & Michal Irani, who present the direct 
approach in the paper About Direct Methods on page 267, and Phil Torr & Andrew 
Zisserman, who present the feature-based approach in the paper Feature Based Methods 
for Structure and Motion Estimation on page 278. In each case, the authors try : (i) to give 
a brief, clear description of the two classes of methods ; (ii) to identify the applications 
in which each has been most successful; and (Hi) to discuss the limitations of each 
approach. The discussion that followed the session is summarized on page 295. 

Bundle adjustment for visual reconstruction: Bundle adjustment is the refinement 
of visual reconstructions by simultaneous optimization over both structure and camera 
parameters. It was initially developed in the late 1950’s and 1960’s in the aerial pho- 
togrammetry community, where already by 1970 extremely accurate reconstruction of 
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networks of thousands of images was feasible. The computer vision community is only 
now starting to consider problems of this size, and is still largely ignorant of the theory 
and methods of bundle adjustment. In part this is because cultural differences make the 
photogrammetry literature relatively inaccessible to most vision researchers, so one aim 
of this session was to present the basic photogrammetric techniques from a computer 
vision perspective. The issues raised in the session are reported in the survey paper Bun- 
dle Adjustment — A Modern Synthesis on page 298. This paper is rather long, but we 
publish it in the hope that it will be useful to the community to have the main elements 
of the theory collected in one place. 

The workshop ended with an open panel session, with Richard Hartley, P. Anandan, 
Jitendra Malik, Joe Mundy and Olivier Faugeras as panelists. Each panelist selected a 
topic related to the workshop theme that he felt was important, and gave a short position 
statement on it followed by questions and discussion. The panel finished with more 
general discussion. A brief summary of the discussion and the issues raised by the panel 
is given on page 376. 

Finally, we would like to thank the many people who helped to organize the workshop, 
and without whom it would not have been possible. The scientific helpers are listed on the 
following pages, but thanks must also go to: John Tsotsos, the chairman of ICCV’99, 
for his help with the logistics and above all for hosting a great main conference; to 
Mary-Kate Rada and Maggie Johnson of the IEEE Computer Society, and to Daniele 
Herzog of INRIA for their efficient organizational support; to the staff of the Corfu 
Holiday Palace for some memorable catering ; and to INRIA Rhone- Alpes and the IEEE 
Computer Society for agreeing to act as sponsors. 
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Abstract. While many algorithms for computing stereo correspondence 
have been proposed, there has been very little work on experimentally 
evaluating algorithm performance, especially using real (rather than syn- 
thetic) imagery. In this paper we propose an experimental compari- 
son of several different stereo algorithms. We use real imagery, and ex- 
plore two different methodologies, with different strengths and weak- 
nesses. Our first methodology is based upon manual computation of 
dense ground truth. Here we make use of a two stereo pairs: one of 
these, from the University of Tsukuba, contains mostly fronto-parallel 
surfaces; while the other, which we built, is a simple scene with a slanted 
surface. Our second methodology uses the notion of prediction error, 
which is the ability of a disparity map to predict an (unseen) third 
image, taken from a known camera position with respect to the in- 
put pair. We present results for both correlation-style stereo algorithms 
and techniques based on global methods such as energy minimization. 
Our experiments suggest that the two methodologies give qualitatively 
consistent results. Source images and additional materials, such as the 
implementations of various algorithms, are available on the web from 
http : //www. resear ch.microsoft . com/~szeliski/ stereo. 



1 Introduction 

The accurate computation of stereo depth is an important problem in early 
vision, and is vital for many visual tasks. A large number of algorithms have 
been proposed in the literature (see |81 1 1 )j for literature surveys). However, the 
state of the art in evaluating stereo methods is quite poor. Most papers do not 
provide quantitative comparisons of their methods with previous approaches. 
When such comparisons are done, they are almost inevitably restricted to syn- 
thetic imagery. (However, see [1 3j for a case where real imagery was used to 
compare a hierarchical area-based method with a hierarchical scanline matching 
algorithm.) 

The goal of this paper is to rectify this situation, by providing a quantitative 
experimental methodology and comparison among a variety of different methods 
using real imagery. There are a number of reasons why such a comparison is valu- 
able. Obviously, it allows us to measure progress in our field and motivates us to 
develop better algorithms. It allows us to carefully analyze algorithm characteris- 
tics and to improve overall performance by focusing on sub-components. It allows 
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us to ensure that algorithm performance is not unduly sensitive to the setting 
of “magic parameters” . Furthermore, it enables us to design or tailor algorithms 
for specific applications, by tuning these algorithms to problem-dependent cost 
or fidelity metrics and to sample data sets. 

We are particularly interested in using these experiments to obtain a deeper 
understanding of the behavior of various algorithms. To that end, we focus on 
image phenomena that are well-known to cause difficulty for stereo algorithms, 
such as depth discontinuities and low-texture regions. 

Our work can be viewed as an attempt to do for stereo what Barron et al.’s 
comparative analysis of motion algorithms Pj accomplished for motion. The mo- 
tion community benefited significantly from that paper; many subsequent papers 
have made use of these sequences. However, Barron et al. rely exclusively on 
synthetic data for their numerical comparisons; even the well-known “Yosemite” 
sequence is computer-generated. As a consequence, it is unclear how well their 
results apply to real imagery. 

This paper is organized as follows. We begin by describing our two evaluation 
methodologies and the imagery we used. In section 01 we describe the stereo 
algorithms that we compare and give some implementation details. Section 0] 
gives experimental results from our investigations. We close with a discussion of 
some extensions that we are currently investigating. 

2 Evaluation Methodologies 

We are currently studying and comparing two different evaluation methodolo- 
gies: comparison with ground truth depth maps, and the measurement of novel 
view prediction errors. 

2.1 Data Sets 

The primary data set that we used is a multi-image stereo set from the University 
of Tsukuba, where every pixel in the central reference image has been labeled 
by hand with its correct disparity. The image we use for stereo matching and 
the ground truth depth map are shown in figure O Note that the scene is fairly 
fronto-planar, and that the ground truth contains a small number of integer- 
valued disparities. 

The most important limitation of the Tsukuba imagery is the lack of slanted 
surfaces. We therefore created a simple scene containing a slanted surface. The 
scene, together with the ground truth, are shown in figure |21 The objects in 
the scene are covered with paper that has fairly high-texture pictures on it. 
In addition, the scene geometry is quite simple. Additional details about this 
imagery can be found at the web site for this paper, 
http : //www. research .micro soft . com/ 'szeliski/ stereo. 

2.2 Comparison with Ground Truth 

The ground truth images are smaller than the input images; we handle this by 
ignoring the borders (i.e., we only compute error statistics at pixels which are 
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Discontinuities Smooth regions 

Fig. 1. Imagery from the University of Tsukuba 



given a label in the ground truth) . Discarding borders is particularly helpful for 
correlation-based methods, since their output near the border is not well-defined. 
The interesting regions in the Tsukuba imagery include: 

— Specular surfaces, including the gray shelves in the upper left of the image, 
the orange lamp, and the white statue of a face. Specularities cause difficulty 
in computing depth, due to the reflected motion of the light source. 

— Textureless regions, including the wall at the top right corner and the deeply 
shadowed area beneath the table. Textureless regions are locally ambiguous, 
which is a challenge for stereo algorithms. 

— Depth discontinuities, at the borders of all the objects. It is difficult to 
compute depth at discontinuities, for a variety of reasons. It is especially 
difficult for thin objects, such as the orange lamp handle. 

— Occluded pixels, near some of the object borders. Ideally, a stereo algorithm 
should detect and report occlusions; in practice, many algorithms do not do 
this, and in fact tend to give incorrect answers at unoccluded pixels near the 
occlusions. 

Our goal is to analyze the effectiveness of different methods in these differ- 
ent regions. We have used the ground truth to determine the depth disconti- 
nuities and the occluded pixels. A pixel is a depth discontinuity if any of its 
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Image Ground truth 

Fig. 2. Imagery from Microsoft Research. 



(4-connected) neighbors has a disparity that differs by more than 1 from its dis- 
parityfl A pixel is occluded if according to the ground truth it is not visible in 
both images. 

While the Tsukuba imagery contains textureless regions, there is no natural 
way to determine these from the ground truth. Instead, we looked for large 
regions where the image gradient was small. We found five such regions, which 
are shown in figure ^ These regions correspond to major features in the scene; 
for example, one is the lamp shade, while two come from shadowed regions under 
the table. 

We have computed error statistics for the depth discontinuities and for the 
textureless regions separately from our statistics for the other pixels. Since the 
methods we wish to compare include algorithms that do not detect occlusions, we 
have ignored the occluded pixels for the purpose of our statistics. Our statistics 
count the number of pixels whose disparity differs from the ground truth by more 
than ±1. This makes sense because the true disparities are usually fractional. 
Therefore, having an estimate which differs from the true value by a tiny amount 
could be counted as an error if we required exact matches. In fact, we also 
computed exact matches, and obtained quite similar overall results. 



2.3 Comparison Using Prediction Error 

An alternative approach to measuring the quality of a stereo algorithm is to 
test how well it predicts novel views o This is a particularly appropriate test 
when the stereo results are to be used for image-based rendering, but it is also 
useful for other tasks such as motion-compensated prediction in video coding and 
frame rate conversion. This approach parallels methodologies that are prevalent 

^ Neighboring pixels that are part of a sloped surface can easily differ by 1 pixel, but 
should not be counted as discontinuities. 

^ A third possibility is to run the stereo algorithm on several different pairs within a 
multi-image data set, and to compute the self-consistency between different disparity 
estimates We will discuss this option in more detail in sectional 
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in many other research areas, such as speech recognition, machine learning, and 
information retrieval. In these disciplines, it is traditional to partition data into 
training data that is used to tune up the system, and test data that is used 
to evaluate it. In statistics, it is common to leave some data out during model 
fitting to prevent over-fitting (cross-validation). 

To apply this methodology to stereo matching, we compute the depth map 
using a pair of images selected from a larger, multi-image stereo dataset, and 
then measure how well the original reference image plus depth map predicts the 
remaining views. The University of Tsukuba data set is an example of such a 
multi-image data set: the two images in figure E are part of a 5 x 5 grid of views 
of this scene. Many other examples of multi-image data sets exist, including the 
Yosemite fly-by, the SRI Trees set, the NASA (Coke can) set, the MPEG-4 
flower garden data set, and many of the data sets in the CMU Image Database 
(http://www.ius.cs.cmu.edu/idb). Most of these data sets do not have an 
associated ground truth depth map, and yet all of them can be used to evaluate 
stereo algorithms if prediction error is used as a metric. 

When developing a prediction error metric, we must specify two different 
components: an algorithm for predicting novel views, and a metric for determin- 
ing how well the actual and predicted images match. A more detailed description 
of these issues can be found in our framework paper, which lays the foundations 
for prediction error as a quality metric m- 

In terms of view prediction, we have a choice of two basic algorithms. We can 
generate novel views from a color/depth image using forward warping . which 
involves moving the source pixels to the destination image and potentially filling 
in gaps. Alternatively, we can use inverse warping to pull pixels from the new 
(unseen) views back into the coordinate frame of the original reference image. 
This is easier to implement, since no decision has to be made as to which gaps 
need to be filled. 

Unfortunately, inverse warping will produce erroneous results at pixels which 
are occluded (invisible) in the novel views, unless a separate occlusion (or vis- 
ibility) map is computed for each novel view. Without this occlusion masking, 
certain stereo algorithms actually outperform the ground truth in terms of pre- 
diction error, since they try to match occluded pixels to some other pixel of the 
same color. In our experiments, therefore, we do not include occluded pixels in 
the computation of prediction error. 

The simplest error metric that can be used in an L 2 (root mean square) 
distance between the pixel values. It is also possible to use a robust measure, 
which downweights large error, and to count the number of outliers Another 
possibility is to compute the per-pixel residual motion between the predicted and 
real image, and to compensate one of the two images by this motion to obtain a 
compensated RMS or robust error measure EQ]- For simplicity, we use the raw 
(uncompensated and un-robustified) RMS error. 
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3 Algorithms 

Loosely speaking, methods for computing dense stereo depth can be divided 
into two classes 1 j The first class of methods allow every pixel to independently 
select its disparity, typically by analyzing the intensities in a fixed rectangular 
window. These methods use statistical methods to compare the two windows, 
and are usually based on correlation. The second class of methods relies on global 
methods, and typically find the depth map that minimizes some function, called 
the energy or the objective function. These methods generally use an iterative 
optimization technique, such as simulated annealing. 

3.1 Local Methods Based on Correlation 

We implemented a number of standard correlation-based methods that use fixed- 
size square windows. We define the radius of a square whose side length is 2r-|- 1 
to be r. The methods we chose were: 

— Correlation using the L 2 and L\ distance. The L 2 distance is the 
correlation-based method, while the Li distance is more robust. 

— Robust correlation using M-estimation with a truncated quadratic 

— Robust correlation using Least Median Squares nq. 

3.2 Global Methods 

Most global methods are based on energy minimization, so the major variable 
is the choice of energy function. Some stereo methods minimize a 1-dimensional 
energy function independently along each scanline mm- This minimization 
can be done efficiently via dynamic programming m. More recent work has 
enforced some consistency between adjacent scanlines. We have found that one of 
these methods, MLMHV 0, performs quite well in practice, so we have included 
it in our study. 

The most natural energy functions for stereo are two-dimensional, and con- 
tain a data term and a smoothness term. The data term is typically of the form 
Sp [^(p) ~ I' {P + d{p))]^ , where d is the depth map, p ranges over pixels, and / 
and I' are the input images. For our initial experiments, we have chosen a simple 
smoothness term which behaves well for largely front-planar imagery (such as 
that shown in figure^. This energy function is the Potts energy, and is simply 
the number of adjacent pixels with different disparities. 

In the energy minimization framework, it is difficult to determine whether an 
algorithm fails due to the choice of energy function or due to the optimization 
method. This is especially true because minimizing the energy functions that 
arise in early vision is almost inevitably NP-hard M- By selecting a single 
energy function for our initial experiments, we can control for this variable. 

® There are a number of stereo methods that compute a sparse depth map. We do not 
consider these methods for two reasons. First, a dense output is required for a number 
of interesting applications, such as view synthesis. Second, a sparse depth map makes 
it difficult to identify statistically significant differences between algorithms. 
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Fig. 3. Results for several algorithms, on the imagery of figure Q] 



We used three methods to minimize the energy: 

— Simulated annealing is the most common energy minimization tech- 

nique. Following we experimented with several different annealing sched- 
ules; our data is from the one that performed best. 

— Graph cuts are a combinatorial optimization technique that can be used to 
minimize a number of different energy functions [ 7 | . Other algorithms based 
on graph cuts are given in HHIE]. 

— Mean field methods replace the stochastic update rules of simulated anneal- 
ing with deterministic rules based either on the behavior of the mean or 
mode disparity at each pixel cn, or the local distribution of probabilities 
across disparity m- We present results for the latter algorithm. 

The final method we experimented with is a global method that is not based 
on energy minimization. This algorithm, due to Zitnick and Kanade |23| . is a 
cooperative method in the style of the Marr-Poggio algorithm. A particularly in- 
teresting feature of the algorithm is that it enforces various physical constraints, 
such as uniqueness. The uniqueness constraint states that a non-occluded pixel 
in one image should map to a unique pixel in the other image. 
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Fig. 4. Errors on ground truth data, from the results shown in figure Q 



4 Experimental Results 

We have run all the mentioned algorithms on the Tsukuba imagery, and used 
both ground truth and prediction error to analyze the results. In addition, we 
have run the correlation-based algorithms on the Microsoft Research imagery. 

4.1 Results on the Tsukuba Imagery 

Figure |2l shows the depth maps computed by three different algorithms. Fig- 
ures mu show the performance of various algorithms using the ground truth 
performance methodology. Figure]^ shows the performance of correlation-based 
methods as a function of window size. Figure El shows the performance of two 
global optimization methods. 

Figure Q summarizes the overall performance of the best versions of different 
methods. The graph cuts algorithm has the best performance, and makes no 
errors in the low-textured areas shown in figure ^ Simulated annealing, mean- 
field estimation, M-estimation, and MLMHV E) seem to have comparable perfor- 
mance. Note that the differences in overall performance between methods cannot 
be explained simply by their performance at discontinuities or in low-textured 
areas. For example, consider the data for annealing and for graph cuts, shown in 
figure 0 There is a substantial difference in performance at discontinuities and 
in textureless regions, but most errors occur in other portions of the image. 

4.2 Analysis of Ground- Truth Data 

Our data contains some unexpected results. First of all, it is interesting that the 
different correlation-based methods are so similar in terms of their performance. 
In addition, there was surprisingly little variation as a function of window size, 
once the windows were sufficiently large. Finally, the overall performance of 
the correlation-based methods was disappointing, especially near discontinuities. 
Note that an algorithm that assigned every pixel a random disparity would be 
within ±1 of the approximately 20% of the time, and thus correct under our 
definition. 

It is commonly believed that it is important for matching algorithms to 
gracefully handle outliers. In terms of statistical robustness, the L 2 distance is 
the worst, followed by the Li distance. M-estimation is better still, and least 
median squares is best of all. There is some support for this argument in our 
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Fig. 5. Performance of standard correlation-based methods as a function of window 
radius, using the ground truth of figure ^ The graph at top shows errors for all pixels 
(discarding those that are occlusions according to the ground truth). The graph at 
bottom only considers pixels that are discontinuities according to the ground truth. 



data, but it is not clear cut. The Li distance has a small advantage over the L2 
distance, and M-estimation has a slightly larger advantage over the Li distance. 
Least median squares does quite badly (although to the naked eye it looks fairly 
good, especially with small windows). The regions where it makes the most 
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Fig. 6. Performance of global optimization methods as a function of running time, 
using the Potts model energy on the imagery of figure ^ 
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Fig. 7. Performance comparison, using the best results for each algorithm, on the 
imagery of figure 0 
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imagery of figure 0 
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Fig. 9. Performance of correlation methods as a function of window radius, using pre- 
diction error. 



mistakes are the low-texture regions, where it appears that the least median 
squares algorithm treats useful textured pixels (e.g., the bolts on the front of the 
workbench) as outliers. 
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4.3 Analysis of Prediction Error Data 

The prediction error metrics for various algorithms are shown in figures 0 and 
El Figure 0 shows the RMS (uncorrected) prediction error as a function of frame 
number for four different algorithms and two versions of the ground truth data. 
The frames being tested are the ones in the middle row of the 5x5 University of 
Tsukuba data set (we will call these images LLL, LL, L, R, and RR in subsequent 
discussions). The ground truth data was modified to give sub-pixel estimates 
using a sub-pixel accurate correlation-based technique whose search range was 
constrained to stay within a | pixel disparity of the original ground truth. As 
we can see in figure 0 this reduces the RMS prediction error by about 2 gray 
levels. 

We can also see from this figure that error increases monotonically away from 
the reference frame 3 (the left image) , and that prediction errors are worse when 
moving leftward. This is because errors in the depth map due to occlusions in 
the right image are more visible when moving leftward (these areas are being 
exposed, rather than occluded). It is also interesting that the graph cut, mean 
field, and annealing approaches have very similar prediction errors, even though 
their ground truth errors differ. Our current conjecture is that this is because 
graph cuts do a better job of estimating disparity in textureless regions, which 
is not as important for the prediction task. 

Figure 0 shows the prediction error as a function of window size for the 
four correlation-based algorithms we studied. These figures also suggest that a 
window size of 7 is sufficient if prediction error is being used as a metric. The 
shape of these curves is very similar to the ground truth error (figure 0 ), which 
suggests that the two metrics are producing consistent results. 

4.4 Results on the Microsoft Research Imagery 

The results from running different variants of correlation on the imagery of 
figure 0are shown in figiire m Selected output images are given in figure nTil The 
overall curves are quite consistent with the results from the Tsukuba imagery. 
Here, the least median squares algorithm does slightly bettern than the other 
techniques. This is probably because there are no low-texture regions in this 
dataset. 

5 Discussion 

In this paper, we have compared two methodologies for evaluating stereo match- 
ing algorithms, and also compared the performance of several widely used stereo 
algorithms. The two methodologies produce different, but somewhat consistent 
results, while emphasizing (or de-emphasizing) certain kinds of errors. 

The ground truth methodology gives the best possible evaluation of a stereo 
matcher’s quality, since it supposedly knows what the perfect result ( “gold stan- 
dard”) should be. However, it is possible for the ground truth to be inaccurate, 
and it typically is so near discontinuities where pixels are mixtures of values 
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Fig. 10. Results for correlation algorithms (r = 4), on the imagery of figure Errors 
> ±1 are shown in black, while errors of ±1 are shown in gray. 



from different surfaces. Quantization to the nearest integer disparity is a further 
source of error, which we compensate for by only counting errors > ±1 disparity. 
Ground truth also weights regions such as textureless or occluded regions (where 
it is very difficult, if not impossible, to get a reliable result) equally with regions 
where all algorithms should perform well. In our experiments, we have deliber- 
ately excluded occluded regions from our analysis. It may be desirable to treat 
these regions on the same footing as other potential problem areas (textureless 
regions and discontinuities) . Breaking down the error statistics by their location 
in the image, is a step towards trying to rationalize this situation. 

Intensity prediction error is a different metric, which de-emphasizes errors in 
low-texture areas, but emphasizes small (one pixel or sub-pixel) errors in highly 
textured areas. The former is a reasonable idea if the stereo maps are going to 
be used in an image-based rendering application. Those regions where the depth 
estimates are unreliable due to low texture are also regions where the errors are 
less visible. The problem with sub-pixel errors should be fixable by modifying or 
extending the algorithms being evaluated to return sub-pixel accurate disparity 
estimates. 

A third methodology, which we have not yet evaluated, is the self-consistency 
metric of Leclerc et al. [E|. In this methodology, the consistency in 3D location 
(or re-projected pixel coordinates) of reconstructed 3D points from different 
pairs of images is calculated. This shares some characteristics with the intensity 
prediction metric error used in this paper, in that more than two images are used 
to perform the evaluation. However, this metric is more stringent than intensity 
prediction. In low texture areas where the results tend to be error-prone, it is 
unlikely that the self-consistency will be good (whereas intensity prediction may 
be good). There is a possibility that independently run stereo matchers may 
accidentally produce consistent results, but this seems unlikely in practice. In 
the future, we hope to collaborate with the authors of uni to apply our different 
methodologies to the same sets of data. 



5.1 Extensions 

In work to date, we have already obtained interesting results about the relative 
performance of various algorithms and their sensitivity to certain parameters 
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Fig. 11. Performance of standard correlation-based methods as a function of window 
radius, using the ground truth of figure |21 The graph at top shows errors for all pixels 
(discarding those that are occlusions according to the ground truth). The graph at 
bottom only considers pixels that are discontinuities according to the ground truth. 
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(such as window size). However, there are many additional issues and questions 
that we are planning to examine in ongoing work. These issues include: 

— Sub-pixel issues and sampling error: investigate the effects of using a finer 
set of (sub-pixel) disparities on both the ground truth and prediction error 
metrics. 

— Study more algorithms, including minimization with non-Potts energy and 
the use of weighted windows for correlation. 

— Evaluate more data sets. 

— Determine whether it is more important to come up with the correct energy 
to minimize, or whether it is more important to find a good minimum. 

— Investigate the sensitivity of algorithms to various parameter values. 

— Study whether cross-validation (using prediction error in a multi-image stereo 
dataset) can be used to fine-tune algorithm parameters or to adapt them lo- 
cally across an image. 

We hope that our results on stereo matching will motivate others to perform 
careful quantitative evaluation of their stereo algorithm, and that our inquiries 
will lead to a deeper understanding of the behavior (and failure modes) of stereo 
correspondence algorithms. 
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Discussion 

Jean Ponce: You only showed one data set, how much can you say from one 
data set? 

Ramin Zabih: Yes, that’s clearly the major weakness so far — there’s only 
one data set. The two different methodologies point in the same direction. It 
might still be that there’s something strange about the way the ground truth 
was presented, but I think that the agreement is encouraging. So far the state 
of the art has been just to show off different pictures. We’re trying to move 
beyond that, but getting ground truth for the ground-truth methodology is very 
difficult. At the least the prediction error stuff allows you to work on lots of 
different data sets, which we’re in the process of doing. 

Jean Ponce: Yes, but prediction error is so application dependent. It works very 
well for image-based rendering, but if you want to do navigation or whatever, 
then it’s not really appropriate. 

Ramin Zabih: Yes, I agree. 

Yvan Leclerc: Yesterday I talked about our self-consistency methodology for 
comparing stereo algorithms. I was wondering if you could comment on the 
relationship between your approach and ours. 

Rick Szeliski: For those who missed Yvan’s talk, his technique is similar in that 
you start with a multi-image data set. But instead of what we call prediction 
error, where you take a depth map computed with one image pair and predict 
the appearance in all the other images, Yvan’s methodology computes depth 
maps between all possible pairs, and then sees whether they’re consistently pre- 
dicting the same 3D point. I think that it’s a very valid methodology. What 
we hope to do is to test a wider range of data sets with more algorithms, and 
eventually publish a survey paper, along the lines of the kind of comparative 
work that we see already in motion estimation. I think it will be essential to 
include Yvan’s methodology as well. Hopefully we’ll be able to work out some 
sort of a joint evaluation. The two metrics won’t necessarily give the same re- 
sults — appearance prediction is oriented towards image-based rendering and 
is tolerant of errors in low texture regions, whereas Yvan’s method is oriented 
towards structure and might heavily penalize those. Jean’s comment is very well 
taken — this is application-dependent. But you know, in computer vision we’ve 
worked on robotics, robotics, robotics. Even when we stopped working on that 
we still kept the same mind set. But if you look for example at what happens 
with stereo algorithms when you try to do z-keying — you try to extract the 
foreground person from a textured background and put something synthetic be- 
hind him — the result is horrible, it’s just not acceptable, you get these spiky 
halos full of the wrong pixels. As Luc Robert commented yesterday, we can’t 
use computer vision yet in Hollywood. The reason is basically that we’re not 
focusing on the right problems. That’s why I like prediction error — it penalizes 
you heavily for those visible little single pixels errors. 

Yvan Leclerc: Combining our methodologies would be a great idea. Let’s do 
it. 
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Ramin Zabih: One comment about Jean’s point is that in many situations 
there do seem to be consistent differences between the algorithms. We don’t 
have enough ground truth to do convincing statistics yet, but it looks like the 
optimization-based approaches are doing better, certainly at discontinuities, and 
often in low texture areas as well. 

Rick Szeliski: One final comment. We have made these data sets available 
on http://www.research.microsoft.com/~szeliski/stereo/, so that people 
can run their stereo algorithms on them. We are interested in hearing about the 
results, as we intend to publish a comparative survey of the performance of the 
different methods we have access to. 
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Abstract. Popular algorithms for feature matching and model extrac- 
tion fall into two broad categories, generate-and-test and Hough trans- 
form variations. However, both methods suffer from problems in prac- 
tical implementations. Generate-and-test methods are sensitive to noise 
in the data. They often fail when the generated model fit is poor due to 
error in the selected features. Hough transform variations are somewhat 
less sensitive the noise, but implementations for complex problems suffer 
from large time and space requirements and the detection of false posi- 
tives. This paper describes a general method for solving problems where 
a model is extracted from or fit to data that draws benefits from both 
generate-and-test methods and those based on the Hough transform, 
yielding a method superior to both. An important component of the 
method is the subdivision of the problem into many subproblems. This 
allows efficient generate-and-test techniques to be used, including the use 
of randomization to limit the number of subproblems that must be ex- 
amined. However, the subproblems are solved using pose space analysis 
techniques similar to the Hough transform, which lowers the sensitivity 
of the method to noise. This strategy is easy to implement and results in 
practical algorithms that are efficient and robust. We apply this method 
to object recognition, geometric primitive extraction, robust regression, 
and motion segmentation. 



1 Introduction 

The generate-and-test paradigm is a popular strategy for solving model match- 
ing problems such as recognition, detection, and fitting. The basic idea of this 
method is to generate (or predict) many hypothetical model positions using the 
minimal amount of information necessary to identify unique solutions. A se- 
quence of such positions is tested, and the positions that meet some criterion 
are retained. Examples of this technique include RANSAC jOj and the alignment 
method m 

The primary drawback to generate-and-test paradigm is sensitivity to noise. 
Let us call the features that are used in predicting the model position for some 
test the distinguished features, since they play a more important role in whether 
the test is successful. The other features are undistinguished features. Error in 
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the distinguished features causes the predicted position to be in error. As the 
error grows, the testing step becomes more likely to fail. 

To deal with this problem, methods have been developed to propagate errors 
in the locations of the distinguished features |1I8| . Under the assumption of a 
bounded error region for each of the distinguished image features, these methods 
can place bounds on the locations to which the undistinguished model features 
can be located in an image. When we count the number of undistinguished 
model features that can be aligned with image features (with the constraint 
that the distinguished features must always be in alignment up to the error 
bounds) these techniques can guarantee that we never undercount the number 
of alignable features. The techniques will thus never report that the model is 
not present according to some counting criterion when, in fact, the model does 
meet the criterion. 

On the other hand, this method is likely to overcount the number of alignable 
features, even if the bounds on the location of each individual feature are tight. 
The reason for this is that, while this method checks whether there is a model 
position that brings each of the undistinguished model features into alignment 
with image features (along with all of the distinguished features) up to the error 
bounds, it does not check whether there is a position that brings all of the 
counted undistinguished features into alignment up to the error bounds. 

A competing technique for feature matching and model extraction is based 
on the Hough transform. This method also generates hypothetical model posi- 
tions solutions using minimal information, but rather than testing each solution 
separately, the testing is performed by analyzing the locations of the solutions in 
the space of possible model positions (or poses). This is often, but not always, ac- 
complished through a histogramming or clustering procedure. The large clusters 
in the pose space indicate good model fits. We call techniques that examine the 
pose space for sets of consistent matches among all hypothetical matches Hough- 
based methods, since they derive from the Hough transform II 111 51 . While these 
techniques are less sensitive to noise in the features, they are prone to large com- 
putational and memory requirements, as well as the detection of false positive 
instances [ 7 |, if the pose space analysis is not careful. 

In this paper, we describe a technique that combines the generate-and-test 
and Hough-based methods in a way that draws ideas and advantages from each, 
yielding a method that improves upon both. Like the generate-and-test method, 
(partial) solutions based on distinguished features are generated for further ex- 
amination. However, each such solution is under-constrained and Hough-based 
methods are used to determine and evaluate the remainder of the solution. This 
allows both randomization to be used to reduce the computational complexity 
of the method and error propagation techniques to be used in order to better ex- 
tract the relevant models. We call this technique RUDR (pronounced “rudder”), 
for Recognition Using Decomposition and Randomization. 

First, it is shown that the problem can be treated as many subproblems, each 
of which is much simpler than the original problem. We next discuss various 
methods by which the subproblems can be solved. The application of random- 
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ization to reduce the number of subproblems that must be examined is then 
described. These techniques yield efficiency gains over conventional generate- 
and-test and Hough-based methods. In addition, the subdivision of the problem 
allows us to examine a much smaller parameter space in each of the subproblems 
than in the original problem and this allows the error inherent in localization 
procedures to be propagated accurately and efficiently in the matching process. 

This method has a large number of applications. It can be applied to essen- 
tially any problem where a model is fit to cluttered data (i.e. with outliers or 
multiple models present). We discuss the application of this method to object 
recognition, curve detection, robust regression, and motion segmentation. 

The work described here is a generalization of previous work on feature 
matching and model extraction Similar ideas have been used by other 

researchers. A simple variation of this method has been applied to curve detec- 
tion by Murakami et al. m and Leavers In both of these cases, the problem 
decomposition was achieved through the use of a single distinguished feature in 
the image for each of the subproblems. We argue that the optimal performance 
is achieved when the number of distinguished features is one less than the num- 
ber necessary to fully define the model position in the errorless case. This has 
two beneficial effects. First, it reduces the amount of the pose space that must 
be considered in each problem (and the combinatorial explosion in the sets of 
undistinguished features that are examined). Second, it allows a more effective 
use of randomization in reducing the computational complexity of the method. 
A closely related decomposition and randomization method has been described 
by Cass 0 in the context of pose equivalence analysis. He uses a base match to 
develop an approximation algorithm for feature matching under uncertainty. 



2 General Problem Formalization 

The class of problems that we attack using RUDR are those that require a model 
to be fit to a set of observed data features, where a significant portion of the 
observed data may be outliers or there may be multiple models present in the 
data. These problems can, in general, be formalized as follows. 

Given: 

• At : The model to be fit. This model may be a set of distinct features as is 
typical in object recognition, or it may be a parameterized manifold such as a 
curve or surface, as in geometric primitive extraction and robust regression. 

• V : The data to match. This data consists of a set of features or measurements, 
{(5i, ...,(5d}, that have been extracted, for example, from an image. 

• T : The possible positions or transformations of the model. We use r to denote 
individual transformations in this space. 

• A{M,T>, T, t,D) : a binary- valued acceptance criterion that specifies whether 
a transformation, r, satisfactorily brings the model into agreement with a set of 
data features, D G V. We allow this criterion to be a function of the full set of 
data features and the set of transformations to allow the criterion to select the 
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single best subset of data features according to some criterion or to take into 
account global matching information. 

Determine and report: 

• All maximal sets of data features, D G T>, for which there is a transformation, 
T gT, such that the acceptance criterion, D), is satisfied. 

This formalization is very general. Many problems can be formalized in this 
manner, including object recognition, geometric primitive extraction, motion 
segmentation, and robust regression. 

A useful acceptance criterion is based on bounding the fitting error between 
the model and the data. Let C{A4,6,t) be a function that determines whether 
the specified position of the model fits the data feature S (e.g. up to a bounded 
error). We let C(A4,S,t) = 1, if the criterion is satisfied, and = 0, 

otherwise. The model is said to be brought into alignment with a set of data 
features, D = {(5i, ..., S^} up to the error criterion, if all of the individual features 
are brought into alignment: 



\{C{M,5,,t) = 1 (1) 

i=l 

The bounded-error acceptance criterion specifies that a set of data features, 
D = {(5i, ..., should be reported, if the cardinality of the set meets some 
threshold {x > c), there is a position of the model that satisfies (1), and the set 
is not a subset of some larger set that is reported. 

While this criterion cannot incorporate global information, such as mean- 
square-error or least-median-of-squares, RUDR is not restricted to using this 
bounded-error criterion. This method has been applied to least-median-of-squares 
regression with excellent results m- 



Example As a running example, we will consider the detection of circles in 
two-dimensional image data. For this case, our model, M, is simply the param- 
eterization of a circle, (x — Xc)^ + {y ~ Vc)^ = and our data, V, is a set 
of image points. The space of possible transformations is the space of circles, 
T = [xc, Vci ■ We use a bounded-error acceptance criterion such that a point 



is considered to be on the circle if 



^J{x - XcY + {y - ycY - r 



< e. 



We will 



report the circles that have <5^, r) > nr. In other words, we search 

for the circles that have half of their perimeter present in the image. 



3 Approach 

Let us call the hypothetical correspondence between a set of data features and 
the model a matching. The generate-and-test paradigm and many Hough-based 
strategies solve for hypothetical model positions using matchings of the minimum 
cardinality to constrain the model position up to a finite ambiguity (assuming 
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errorless features). We call the matchings that contain this minimal amount 
of information the minimal matchings and we denote their cardinality k. We 
consider two types of models. One type of model consists of a set of discrete 
features similar to the data features. The other is a parameterized model such 
as a curve or surface. When the model is a set of discrete features, the minimal 
matchings specify the model features that match each of the data features in 
the minimal matching and we call these explicit matchings. Otherwise, the data 
features are matched implicitly to the parameterized model and we thus call 
these implicit matchings. 

In the generate-and-test paradigm, the model positions generated using the 
minimal matchings are tested by determining how well the undistinguished fea- 
tures are fit according to the predicted model position. In Hough-based methods, 
it is typical to determine the positions of the model that align each of the min- 
imal matchings and detect clusters of these positions in the parameter space 
that describes the set of possible model positions, but other pose space analysis 
techniques can be used (e.g. m)- 

The approach that we take draws upon both generate-and-test techniques 
and Hough-based techniques. The underlying matching method may be any one 
of several pose space analysis techniques in the Hough-based method (see Section 
4), but unlike previous Hough-based methods, the problem is subdivided into 
many smaller problems, in which only a subset of the minimal matchings is 
examined. When randomization is applied to selecting which subproblems to 
solve, a low computational complexity can be achieved with a low probability of 
failure. 

The key to this method is to subdivide the problem into many small sub- 
problems, in which a distinguished matching of some cardinality g < k between 
data features and the model is considered. Only those minimal matchings that 
contain the distinguished matching are examined in each subproblem and this 
constrains the portion of the pose space that the subproblem considers. We could 
consider each possible distinguished matching of the appropriate cardinality as 
a subproblem, but we shall see that this is not necessary in practice. 

Let’s consider the effect of this decomposition of the problem on the match- 
ings that are detected by a system using a bounded-error criterion, C{M,d,t), 
as described above. For now, we assume that we have some method of deter- 
mining precisely those sets of data features that should be reported according to 
the bounded-error acceptance criterion. The implications of performing match- 
ing only approximately and the use of an acceptance criterion other than the 
bounded-error criterion are discussed subsequently. 



Proposition 1. For any transformation, t G 7', the following statements are 
equivalent: 

1. Transformation r brings at least x data features into alignment with the model 
up to the error criterion. 

2. Transformation r brings at least (f) sets of data features with eardinality k 
into alignment with the model up to the error criterion. 
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3. For any distinguished matching of cardinality g that is brought into alignment 
with the model up to the error criterion by t, there are CkZg) minimal matchings 
that contain the distinguished matching that are brought into alignment up to the 
error criterion by t. 

The proof of this proposition, which follows directly from combinatorics, 
is sketched in m- This result indicates that as long as we examine one distin- 
guished matching that belongs to each of the matchings that should be reported, 
the strategy of subdividing the problem into subproblems yields equivalent re- 
sults to examining the original problem as long as the threshold on the number 
of matches is set appropriately. 

This decomposition of the problem allows our method to be viewed as a 
class of generate-and-test methods, where distinguished matchings (rather than 
minimal matchings) are generated and the testing step is performed using a pose 
space analysis method (such as clustering or pose space equivalence analysis) 
rather than comparing a particular model position against the data. 

While distinguished matchings of any cardinality could be considered, we 
must balance the complexity of the subproblems with the number of subproblems 
that are examined. Increasing the cardinality of the distinguished matching is 
beneficial up to a point. As the size of the distinguished matching is increased, the 
number of minimal matchings that is examined in each subproblem is decreased 
and we have more constraint on the position of the model. The subproblems are 
thus simpler to solve. By itself, this does not improve matters, since there are 
more subproblems to examine. However, since we use randomization to limit the 
number of subproblems that are examined, we can achieve a lower computational 
complexity by having more simple subproblems than fewer difficult ones. On the 
other hand, when we reach g = k, the method becomes equivalent to a generate- 
and-test technique and we lose both the benefits gained through the Hough-based 
analysis of the pose space and the property that the subproblems become simpler 
with larger distinguished matchings. We thus use distinguished matchings with 
cardinality g = k — 1. 

Now, for practical reasons, we may not wish to use an algorithm that reports 
exactly those matchings that satisfy the error criterion, since such algorithms 
are often time consuming. In this case, we cannot guarantee that examining a 
distinguished matching that belongs to a solution that should be reported will 
result in detecting that solution. However, empirical evidence suggests that the 
examination of these subproblems yields superior results when an approximation 
algorithm is used owing to failures that occur in the examination of full 
problem. 

We can also use these techniques with acceptance criteria other than the 
bounded-error criterion. With other criteria, the proposition is no longer always 
true, but if an approximation algorithm is used to detect good matchings, exam- 
ination of the subproblems often yields good results. For example, an application 
of these ideas to least-median-of-squares regression has yielded an approxima- 
tion algorithm that is provably accurate with high probability, while previous 
approximation algorithms do not have this property m- 
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Example For our circle detection example, fc = 3, since three points are suffi- 
cient to define a circle in the noiseless case. The above analysis implies that, 
rather than examining individual image features, or all triples of features, we 
should examine trials (or subproblems) where only the triples that share some 
distinguished pair of features in common. Multiple trials are examined to guard 
against missing a circle. 



4 Solving the Subproblems 

Now, we must use some method to solve each of the subproblems that are ex- 
amined. We can use any method that determines the number of matchings of a 
given cardinality can be brought approximately into alignment with the model at 
a particular position. The simplest method is one that uses a multi-dimensional 
histogramming step in order to locate large clusters in the pose space. This 
method can be implemented efficiently in both time and space m- However, 
errors in the data cause the clusters to spread in a manner that can be difficult to 
handle using this technique. For complex problems, it can become problematic 
to detect the clusters without also detecting a significant number of false posi- 
tives [Z|- Alternatively, recently developed pose equivalence analysis techniques 
developed by Breuel Pj and Cass P] can be applied that allow localization error 
to be propagated accurately. Breuel’s experiments indicate that his techniques 
can operate in linear expected time in the number of matchings, so we can, in 
general, perform this step efficiently. 

In our method, only a small portion of the parameter space is examined in 
each subproblem. If it is assumed that there is no error in the data features in 
the distinguished matching, then each subproblem considers only a sub-manifold 
of the parameter space. In general, if there are p transformation parameters and 
each feature match yields b constraints on the transformation, then a subproblem 
where the distinguished matchings have cardinality g considers only a (p — gh)- 
dimensional manifold of the transformation space in the errorless case. This 
allows us to parameterize the sub-manifold (using p—gh parameters) and perform 
analysis in this lower dimensional space. A particularly useful case is when the 
resulting manifold has only one dimension (i.e. it is a curve). In this case, the 
subproblem can be solved very simply by parameterizing the curve and finding 
positions on the curve that are consistent with many minimal matchings. 

When localization error in the data features is considered, the subproblems 
must (at least implicitly) consider a larger space than the manifold described 
above. The subproblems are still much easier to solve. A technique that is useful 
in this case is to project the set of transformations that are consistent with a 
minimal matching up to the error criterion onto the manifold that results in the 
errorless case and then perform clustering only in the parameterization of this 
manifold as discussed above m 

Example For circle detection, the circle positions that share a pair of points lie on 
a curve in the pose space. (The center of the circle is always on the perpendicular 
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bisector of the two distinguished points.) We parameterize the positions using 
the signed distance d from the center of the circle to the midpoint between the 
distinguished points (positive if above, negative if below). This yields a unique 
descriptor for every circle containing the distinguished points. For each triple 
that is considered, we can project the pose space consistent with the triple onto 
the parameterization by considering which centers are possible given some error 
bounds on the point locations ^21 • We determine if a circle is present in each 
trial by finely discretizing d and performing a simple Hough transform variation, 
where the counter for each bin is incremented for each triple that is consistent 
with the span represented by the counter. Peaks in the accumulator are accepted 
if they surpass some predetermined threshold. 



5 Randomization and Complexity 



A deterministic implementation of these ideas examines each possible distin- 
guished matching with the appropriate cardinality. This requires 0{n^) time, 
where n is the number of possible matches between a data feature and the 
model. When explicit matchings are considered, n = md, where m is the number 
of model features and d is the number of data features. When implicit match- 
ings are considered, n = d. Such a deterministic implementation performs much 
redundant work. There are many distinguished matchings that are part of each 
of the large consistent matchings that we are seeking. We thus find each match- 
ing that meets the acceptance criterion many times (once for each distinguished 
matching that is contained in the maximal matching). We can take advantage 
of this redundancy through the use of a common randomization technique to 
limit the number of subproblems that we must consider while maintaining a low 
probability of failure. 

Assume that some minimum number of the image features belong to the 
model. Denote this number by b. Since our usual acceptance criterion is based 
on counting the number of image features that belong to the model, we can 
allow the procedure to fail when too few image features belong to the model. 
Otherwise, the probability that some set of image features with cardinality g = 
k — 1 completely belongs to the model is approximately bounded by (^) 

If we take t trials that select sets of fc — 1 image features randomly, then the 
probability that none of them will completely belong to the model is: 




Setting this probability below some arbitrarily small threshold {pt < 7) yields: 



ln7 

ln(l-(|)fe-i) ^ VV ’"7' 



(3) 



Now, for explicit matches, we assume that some minimum fraction fg of the 
model features appear in the image. In this case, the number of trials necessary 



28 



C.F. Olson 



is approximately In For each trial, we must consider matching the 

set of image features against each possibly matching set of model features, so the 
total number of distinguished matchings that are considered is approximately 




(A: — 1)! In Each explicit distinguished matching requires 0{md) time 



to process, so the overall time required is 0{md^). 

For implicit matches, we may assume that each significant model in the image 
comprises some minimum fraction of the image features. The number of trials 
necessary to achieve a probability of failure below 7 is approximately /i^~^ln F, 
which is a constant independent of the number of model or image features. Since 
each trial can be solved in 0{d) time, the overall time required is 0{d). 

Note that the complexity can be reduced further by performing subsampling 
among the matchings considered in each trial. Indeed, 0(1) complexity is possible 
with some assumptions about the number of features present and the rate of 
errors allowable | 2 |. We have not found this further complexity reduction to be 
necessary in our experiments. However, it may be useful when the number of 
image features is very large. 



Example Our circle detection case uses implicit matchings. If we assume that 
each circle that we wish to detect comprises at least fi = 5% of the image data 
and require that the probability of failure is below 7 = 0.1%, then the number of 
trials necessary is 2764. Each trial considers the remaining d—2 image features. 
Note that techniques considering all triples will surpass the number of triples 
considered here when d > 53. 



6 Comparison with Other Techniques 

This section gives a comparison of the RUDR approach with previous generate- 
and-test and Hough-based techniques. 

Deterministic generate-and-test techniques require time to perform 

model extraction in general, since there are 0{n^) minimal matchings and the 
testing stage can be implemented 0{n) time. This can often be reduced slightly 
through the use of efficient geometric searching techniques during the testing 
stage (e. g. m)- RUDR yields a superior computational complexity requirement 
for this case. When randomization is applied to generate-and-test techniques, the 
computation complexity becomes 0(77id^+^) (or slightly better using efficient ge- 
ometric search) for explicit matches and 0{d) for implicit matches. RUDR yields 
a superior computational complexity for the case of explicit matches and, while 
the generate-and-test approach matches the complexity for the case of implicit 
matches, RUDR examines less subproblems by a constant factor (approximately 
j^) and is thus faster in practice. 

In addition, previous generate-and-test techniques are inherently less precise 
in the propagation of localization error. The basic generate-and-test algorithm 
introduces false positives unless care is taken to propagate the errors correctly 
m, since error in the data features leads to error in the hypothetical model 
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pose and this error causes some of the models to be missed as a result of a poor 
fit. A more serious problem is that, while the generate-and-test techniques that 
propagate errors correctly ensure that each of the undistinguished features can 
be separately brought into alignment (along with the distinguished set) up to 
some error bounds by a single model position, this position may be different for 
each such feature match. It does not guarantee that all of the features can be 
brought into alignment up to the error bounds by a single position and thus 
causes false positives to be found. 

Hough-based methods are capable of propagating localization error such that 
neither false positives nor false negatives occur (in the sense that only match- 
ings meeting the acceptance criterion are reported) . However, previous 
Hough-based methods have had large time and space requirements. Determinis- 
tic Hough-based techniques that examine minimal matchings require 0{n^) time 
and considerable memory j20) . 

Randomization has been previously applied to Hough transform techniques 
However, in previous methods, randomization has been used in a 
different manner than it is used here. While RUDR examines all of the data in 
each of the subproblems, previous uses of randomization in Hough-based meth- 
ods subsample the overall data examined, causing both false positives and false 
negatives to occur as a result. While false negatives can occur due to the use 
of randomization in the RUDR approach, the probability of such an occurrence 
can be set arbitrarily low. 

Our method draws the ability to propagate localization error accurately from 
Hough-based methods and combines it with the ability to subdivide the problem 
into many smaller subproblems and thus reap the full benefit of randomization 
techniques. The result is a model extraction algorithm with superior computa- 
tional complexity to previous methods that is also robust with respect to false 
positives and false negatives. 

All of the techniques considered so far have been model-based methods. The 
primary drawback to such techniques is a combinatorial complexity that is poly- 
nomial in the number of features, but exponential in the complexity of the pose 
space (as measured by k). This can be subverted in some cases by assuming that 
some fraction of the data features arises from the model (this shifts the base 
of the exponent to the required fraction). An alternative that can be useful in 
reducing this problem is the use of grouping or perceptual organization methods 
that use data-driven techniques to determine features that are likely to belong 
to the same model (for example, [1 211 7] ). In cases where models can be identi- 
fied by purely data-driven methods, such techniques are likely to be faster than 
the techniques described here. However, work has shown the even imperfect fea- 
ture grouping methods can improve both the complexity and the rate of false 
positives in the RUDR method I2H- 

There are some situations where RUDR can not be applied effectively. If a 
single data feature is sufficient to constrain the position of the model, the RUDR 
problem decomposition will not be useful. In addition, the techniques we describe 
will be of less value is when there is a small number of features in the image. In 
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this case, the randomization may not yield an improvement in the speed of the 
algorithm. However, the error propagation benefits will still apply. 



7 Applications of RUDR 

RUDR has been applied to several problems. We review the important aspects of 
these applications here and discuss additional areas where RUDR can be applied. 

7.1 Extraction of Geometric Primitives 

The Hough transform is a well known technique for geometric primitive extrac- 
tion j1 1 II . The application of RUDR to this method improves the efficiency 
of the technique, allows the localization error to be propagated accurately, and 
reduces the amount of memory that is required |22| . 

Consider the case of detecting curves from feature points in two-dimensional 
image data. If we wish to detect curves with p parameters, then we use distin- 
guished matchings consisting of p— 1 feature points, since, in general, p points are 
required to solve for the curve parameters. Each distinguished matching maps 
to a one-dimensional manifold (a curve) in the parameter space, if the points are 
errorless and in general position. Methods have been developed to map minimal 
matchings with bounded errors into segments of this curve for the case of lines 
and circles 123 . 0{d) time and space is required for curve detection with these 
techniques, where d is the number of data points extracted from the image. 

Figure [Dshows the results of using RUDR to detect circles in a binary image 
of an engineering drawing. The results are very good, with the exception of 
circles found with a low threshold that are not perceptually salient. However, 
these circles meet the acceptance criterion specified, so this is not a failure of 
the algorithm. 

The image in Figure d contains 9299 edge pixels. In order to detect circles 
comprising 4% of the image, RUDR examines 4318 trials and considers 4.01 x 
10^ triples. Contrast this to the 8.04 x 10^^ possible triples. A generate-and- 
test technique using the same type of randomization examines 1.08 x 10^ trials 
(1.00 X 10® triples) to achieve the same the same probability of examining a trial 
where the distinguished features belong to some circle, but will still miss circles 
due to the error in the features. 

7.2 Robust Regression 

RUDR can be applied to the problem of finding the least-median-of-squares 
(LMS) regression line. The most commonly considered problem is to fit a line 
to points in the plane. We apply RUDR to this problem by considering a series 
of distinguished points in the data. A single distinguished point is examined 
in each trial (since only two are required to define a line). For each trial, we 
determine the line that is optimal with respect to the median residual, but with 
the constraint that the line must pass through the distinguished point. 
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(c) (d) 



Fig. 1. Circle detection, (a) Engineering drawing, (b) Circles found comprising 4 % of 
the image, (c) Perceptually salient circles found comprising 0.8% of the image, (d) 
Insalient circles found comprising 0.8% of the image. 



It can be shown that the solution to this constrained problem has a median 
residual that is no more than the sum of the optimal median residual and the 
distance of the distinguished point from the optimal LMS regression line m- 
Now, at least half of the data points must lie no farther from the optimal re- 
gression line than the optimal median residual (by definition). Each trial thus 
has a probability of at least 0.5 of obtaining a solution with a residual no worse 
than twice the optimal median residual. The use of randomization implies that 
we need to perform only a constant number of trials to achieve a good solution 
with high probability (approximately — log 2 <5 trials are necessary to achieve an 
error rate of 6). 

Each subproblem (corresponding to a distinguished point) can be solved using 
a specialized method based on parametric search techniques unj. This allows 
each subproblem to be solved exactly in O(nlog^n) time or in 0(n log n) time 
for a fixed precision solution using numerical techniques. These techniques have 
also been extended to problems in higher dimensional spaces. 
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Fig. 2. Robust regression examples. The solid lines are the RUDR LMS estimate. The 
dashed lines are the PROGRESS LMS estimate. The dotted lines are the least-squares 
fit. 



The complexity of our method is superior to the best known exact algorithms 
for this problem The PROGRESS algorithm 1231 is a commonly used approx- 
imation algorithm for LMS regression that is based on the generate-and-test 
paradigm. It requires 0(n) time. However, unlike our algorithm, this algorithm 
yields no lower bounds (with high probability) on the quality of the solution 
detected. 

Figure El shows two examples where RUDR, PROGRESS, and least-squares 
estimation were used to perform regression. In these examples, there were 400 
inliers and 100 outliers, both from two-dimensional normal distributions. For 
these experiments, 10 trials of the RUDR algorithm were considered, and 50 
trials of the PROGRESS algorithm. For both cases, RUDR produces the best 
fit to the inliers. The least-squares fit is known to be non-robust, so it is not 
surprising that it fairs poorly. The PROGRESS algorithm has difficulty, since, 
even in 50 trials, it does not generate a solution very close to the optimal solution. 

7.3 Object Recognition 

The application of RUDR to object recognition yields an algorithm with 0{md^) 
computational complexity, where m is the number of model features, d is the 
number of data features, and k is the minimal number of feature matches neces- 
sary to constrain the position of the model up to a finite ambiguity in the case 
of errorless features in general position. 

For recognizing three-dimensional objects using two-dimensional image data, 
A: = 3. In each subproblem, we compute the pose for each minimal matching 
containing the distinguished matching using the method of Huttenlocher and 
Ullman P33- We then use a multi-dimensional histogramming technique that 
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(a) (b) 

Fig. 3. Three-dimensional object recognition, (a) Corners detected in the image, (b) 
Best hypothesis found. 



examines each axis of the pose space separately. After finding the clusters along 
some axis in the pose space, the clusters of sufficient size are then analyzed 
recursively in the remainder of the pose space EOI. The poses for all sets of points 
sharing a distinguished matching of cardinality fc — 1 lie in a two-dimensional 
subspace for this case. Despite this, we perform the histogramming in the full 
six-dimensional space, since this requires little extra time and space with this 
histogramming method. Feature error has been treated in an ad hoc manner in 
this implementation through the examination of overlapping bins in the pose 
space. Complex images may require a more thorough analysis of the errors. 

We can also apply these techniques to images in which imperfect grouping 
techniques have determined sets of points that are likely to derive from the same 
object !2H. This allows a reduction in both the computational complexity and 
the rate of false positives. Figure El shows an example where this approach has 
been applied to the recognition of a three-dimensional object. 



7.4 Motion Segmentation 

RUDR can be used to perform motion segmentation with any technique for 
determining structure and motion from corresponding data features in multiple 
images. In this problem, we are given sets of data features in multiple images. We 
assume that we know the feature correspondences between images (e.g. from a 
tracking mechanism), but not which sets of features belong to coherent objects. 

Say that we have an algorithm to determine structure and motion using 
k feature correspondences in i images and that there are d features for which 
we know the correspondences between the images (see [S| for a review of such 
techniques) . We examine distinguished matchings of fc — 1 sets of feature corre- 
spondences between the images. Each subproblem is solved by determining the 
hypothetical structure and motion of each minimal matching (fc sets of feature 
correspondences) containing the distinguished matching and then determining 
how many of the minimal matchings yield consistent structures for the distin- 
guished matching and motions that are consistent with them belonging to a 
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single object. This is repeated for enough distinguished matchings to find all 
of the rigidly moving objects consisting of some minimum fraction of all image 
features. 

Our analysis for implicit matchings implies that we must examine approxi- 
mately In i trials to find objects whose fraction of the total number of data 
features is at least e with a probability of failure for a particular object no larger 
than 7. 



8 Summary 

This paper has described a technique that we have named RUDR for solving 
model extraction and fitting problems such as recognition and regression. This 
approach is very general and can be applied to a wide variety problems where 
a model is fit to a set of data features and it is tolerant to noisy data features, 
occlusion, and outliers. 

The RUDR method draws advantages from both the generate-and-test para- 
digm and from parameter space methods based on the Hough transform. The 
key ideas are: (1) Break down the problem into many small subproblems in 
which only the model positions consistent with some distinguished matching of 
features are examined. (2) Use randomization techniques to limit the number 
of subproblems that need to be examined to guarantee a low probability of 
failure. (3) Use clustering or parameter space analysis techniques to determine 
the matchings that satisfy the criteria. 

The use of this technique yields two primary advantages over previous meth- 
ods. First, RUDR is computationally efficient and has a low memory require- 
ment. Second, we can use methods by which the localization error in the data 
features is propagated precisely, so that false positives and false negatives do not 
occur. 
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Discussion 

Tom Drummond: It seems to me that there’s an asymmetry between the 
noise-fitting of the points you choose for your (fc — 1)-D model and the noise 
distribution in pose space. Can you comment on how you cope with this? 

Clark Olson: The features of the distinguished matching (the (fc— 1)-D model) 
do play a more important role in each trial than the remaining features. The error 
in each feature is treated the same way in each minimal matching that is exam- 
ined, but the features in the distinguished matching are seen in every minimal 
matching for a particular trial. The use of a distinguished matching constrains 
the pose to a sub-manifold of the pose space. Within this sub-manifold, the poses 
of the minimal matchings will be clustered in a smaller area, and the center will 
be shifted from the center of the set of all correct minimal matchings. However, 
with a precise method to process each trial, proposition 1 implies that the use 
of distinguished matchings has no effect on the overall result of the algorithm so 
long as we include at least one distinguished matching belonging to each model 
that should be reported. 

Andrew Fitzgibbon: Just an observation. As well as reducing the dimension- 
ality of the loci in the Hough space, in the circle case you also make them simpler 
— they’re cones with raw Hough, but just lines with your case. That makes them 
easier to cluster, etc. 

Clark Olson: Yes, that’s correct. 
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Abstract. A new approach to characterizing the performance of point- 
correspondence algorithms is presented. Instead of relying on any 
“ground truth’, it uses the self-consistency of the outputs of an algo- 
rithm independently applied to different sets of views of a static scene. 
It allows one to evaluate algorithms for a given class of scenes, as well as 
to estimate the accuracy of every element of the output of the algorithm 
for a given set of views. Experiments to demonstrate the usefulness of 
the methodology are presented. 



1 Introduction and Motivation 

One way of characterizing the performance of a stereo algorith is to compare 
its matches against “ground truth.” If sufficient quantities of accurate ground 
truth were available, estimating the distribution of errors over many image pairs 
of many scenes (within a class of scenes) would be relatively straightforward. 
This distribution could then be used to predict the accuracy of matches in new 
images. Unfortunately, acquiring ground truth for any scene is an expensive and 
problematic proposition at best. 

Instead, we propose to estimate a related distribution, which can be derived 
automatically from the matches of many image pairs of many scenes, assuming 
only that the projection matrices for the image pairs (and their covariances) 
have been correctly estimated, up to an unknown projective transformation. 

The related self-consistency distribution, as we call it, is the distribution of 
the normalized difference between triangulations of matches obtained when one 
image is fixed and the projection matrix of the second image is changed, averaged 
over all matches, many images, and many scenes. 

* This work was sponsored in part by the Defense Advanced Research Projects Agency 
under contract F33615-97-C-1023 monitored by Wright Laboratory. The views and 
conclusions contained in this document are those of the authors and should not be 
interpreted as representing the official policies, either expressed or implied, of the 
Defense Advanced Research Projects Agency, the United States Government, or SRI 
International. 
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Intuitively, a perfect stereo algorithm is one for which the triangulations are 
invariant to changes in the second image, that is, one for which the mean and 
variance of the self-consistency distribution are zero. The extent to which the 
distribution deviates from this is a measure of the accuracy of the algorithm. Of 
course, a stereo algorithm that is perfectly self-consistent in this sense can still 
have systematic biases. We will discuss such biases in a more complete version 
of this paper. 

Although the self-consistency distribution is an important global characteri- 
zation of a stereo algorithm, it would be better if we could refine the predicted 
accuracy of individual matches given just a pair of images. We propose to do 
this by estimating the self-consistency distribution as a function of some type of 
“score” (such as sum of squared difference) that can be computed for each match 
using only the image pair. Conditionalizing the self-consistency distribution like 
this not only allows us to better predict the accuracy of individual matches, 
it also allows us to compare different scoring functions to see which one best 
correlates with the self-consistency of matches. 

The self-consistency distribution is a very simple idea that has powerful con- 
sequences. It can be used to compare algorithms, compare scoring functions, 
evaluate the performance of an algorithm across different classes of scenes, tune 
algorithm parameters, reliably detect changes in a scene, and so forth. All of this 
can be done for little manual cost beyond the precise estimation of the camera 
parameters and perhaps manual inspection of the output of the algorithm on a 
few images to identify systematic biases. 

In the remainder of this paper we will describe the algorithm we use for 
estimating the self-consistency distribution of general n-point correspondence 
algorithms given prior collections of images. This includes the development of a 
method to normalize the triangulation (or reprojection) differences due to the 
inherent errors arising from the nominal accuracy of the matches, the projec- 
tion matrices, and their covariances. Monte Carlo experiments are presented to 
justify the normalization method. We will then show how the self-consistency 
distribution can be used to to compare stereo algorithms and scoring functions. 

2 Previous Work in Estimating Uncertainty 

Existing work on estimating uncertainty without ground truth falls into two 
categories: analytical approaches and statistical approaches. 

The analytical approaches are based on the idea of error propagation HH 
When the output is obtained by optimizing a certain criterion (like a correlation 
measure), the shape of the optimization curve PI2EI or surface Pj provides esti- 
mates of the covariance through the second-order derivatives. These approaches 
make it possible to compare the uncertainty of different outputs given by the 
same algorithm. However, it is problematic to use them to compare different 
algorithms. 
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Statistical approaches make it possible to compute the covariance given only 
one data sample and a black-box version of an algorithm, by repeated runs of 
the algorithm, and application of the law of large numbers 

Both of the above approaches characterize the performance of a given output 
only in terms of its expected variation with respect to additive white noise. 
In | | 1 5 | . the accuracy was characterized as a function of image resolution. The 
bootstrap methodology 0 goes further, since it makes it possible to characterize 
the accuracy of a given output with respect to IID noise of unknown distribution. 
Even if such an approach could be applied to the multiple image correspondence 
problem, it would characterize the performance with respect to IID sensor noise. 
Although this is useful for some applications, for other applications it is necessary 
to estimate the expected accuracy and reliability of the algorithms as viewpoint, 
scene domain, or other imaging conditions are varied. This is the problem we 
seek to address with the self-consistency methodology. 

3 The Self-Consistency Distribution 

To understand the self-consistency distribution, consider the following thought 
experiment. Consider a match (771^,7715) derived from two images, A and B. 
Now, fix image A and ruA, vary the projection matrix of the second image to 
produce image B' , and apply the stereo algorithm to images A and S' produc- 
ing the match {mA,TnB')- Because the coordinates of the matches are identical 
in image A, the two matches should triangulate to the same point in space, 
within the expected error induced by the nominal precision of the matches, the 
projection matrices, and their covariances. 

The distribution of the difference between the triangulations of the two 
matches, after suitable normalization, averaged over all matches derived from 
many image pairs of many scenes, is what we call the self-consistency distribu- 
tion for that algorithm. We will discuss in detail the normalization later in the 
paper. 

When the triangulations are equal to within the expected error, the matches 
are said to be consistent. When an an algorithm produces matches that are 
always consistent in this sense, we say that the algorithm is self-consistent. 

Note that the self-consistency distribution is directly applicable to change 
detection by using the x% confidence interval for a match. The x% confidence 
interval is the largest normalized distance that two matches (with the same 
coordinate in one image) can have x% of the time. Thus, two matches derived 
from images of a scene taken at different times can then be compared against 
this confidence interval to see if the scene has changed over time at that point 
(see 0 ). 

3.1 A Methodology for Estimating the Self-Consistency 
Distribution 

Ideally, the self-consistency distribution should be computed using all possible 
variations of viewpoint and camera parameters (within some class of variations) 
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over all possible scenes (within some class of scenes). However, we can compute 
an estimate of the distribution using some small number of images of a scene, 
and average this distribution over many scenes. 

In the thought experiment above, we first found a match, fixed the coordinate 
of the match in one image, varied the camera parameter of the second image to 
get a second match, and then computed the normalized distance between their 
triangulations. 

Here we start with some fixed collection of images assumed to have been 
taken at exactly the same time (or, equivalently, a collection of images of a static 
scene taken over time). Each image has a unique index and associated projection 
matrix and (optionally) projection covariances. We then apply a stereo algorithm 
independently to all pairs of images in this collection!!. The image indices, match 
coordinates, and score, are reported in match files for each image pair. 

We now search the match files for pairs of matches that have the same co- 
ordinate in one image. For example, if a match is derived from images 1 and 2, 
another match is derived from images 1 and 3, and these two matches have the 
same coordinate in image 1, then these two matches correspond to one instance 
of the thought experiment. Such a pair of matches, which we call a common-point 
match set, should be self-consistent because they should correspond to the same 
point in the world. This extends the principle of the trinocular stereo constraint 
tl2] to arbitrary camera configurations and multiple images. 

Given two matches in a common-point match set, we can now compute the 
distance between their triangulations, after normalizing for the camera con- 
figurations. The histogram of these normalized differences, computed over all 
common-point matches, is our estimate of the self-consistency distribution. 

Another distribution that one could compute using the same data files would 
involve using all the matches in a common-point match set, rather than just 
pairs of matches. For example, one might use the deviation of the triangulations 
from the mean of all triangulations within a set. This is problematic for several 
reasons. 

First, there are often outliers within a set, making the mean triangulation 
less than useful. One might mitigate this by using a robust estimation of the 
mean. But this depends on various (more or less) arbitrary parameters of the 
robust estimator that could change the overall distribution. 

Second, and perhaps more importantly, we see no way to extend the normal- 
ization used to eliminate the dependence on camera configurations, described in 
Sect.m to the case of multiple matches. 

Third, we see no way of using the above variants of the self-consistency 
distribution for change detection. 



^ Note that the “stereo” algorithm can find matches in n > 2 images. In this case, the 
algorithm wonld be applied to all subsets of size n. We use n = 2 to simplify the 
presentation here. 
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(a) sample ,,s . j- ^ -u (c) scatter diagrams with MDL 

(b) self-consistency distributions ' ' ° 

image score 

Fig. 1. Results on two different types of images: terrain (top) vs. tree canopy (bottom). 



3.2 An Example of the Self-Consistency Distribution 

To illustrate the self-consistency distribution, we first apply the above method- 
ology to the output of a simple stereo algorithm p| . The algorithm first rectifies 
the input pair of images and then searches for 7x7 windows along scan lines that 
maximize a normalized cross-correlation metric. Sub-pixel accuracy is achieved 
by fitting a quadratic to the metric evaluated at the pixel and its two adjacent 
neighbours. The algorithm first computes the match by comparing the left image 
against the right and then comparing the right image against the left. Matches 
that are not consistent between the two searches are eliminated. 

The stereo algorithm was applied to all pairs of five aerial images of bare 
terrain, one of which is illustrated in the top row of Fig. CKa). These images 
are actually small windows from much larger images (about 9000 pixels on a 
side) for which precise ground control and bundle adjustment were applied to 
get accurate camera parameters. 

Because the scene consists of bare, relatively smooth, terrain with little veg- 
etation, we would expect the stereo algorithm described above to perform well. 
This expectation is confirmed anecdotally by visually inspecting the matches. 

However, we can get a quantitative estimate for the accuracy of the algorithm 
for this scene by computing the self-consistency distribution of the output of the 
algorithm applied to the ten images pairs in this collection. Figure DJb) shows 
two versions of the distribution. The solid curve is the probability density (the 
probability that the normalized distance equals x) . It is useful for seeing the mode 
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and the general shape of the distribution. The dashed curve is the cumulative 
probability distribution (the probability that the normalized distance is less than 
x) . It is useful for seeing the median of the distribution (the point where the curve 
reaches 0.5) or the fraction of match pairs with normalized distances exceeding 
some value. 

In this example, the self-consistency distribution shows that the mode is 
about 0.5, about 95% of the normalized distances are below 1, and that about 
2% of the match pairs have normalized distances above 10. 

In the bottom row of Fig. Q] we see the self-consistency distribution for the 
same algorithm applied to all pairs of five aerial images of a tree canopy. Such 
scenes are notoriously difficult for stereo algorithms. Visual inspection of the 
output of the stereo algorithm confirms that most matches are quite wrong. 
This can be quantified using the self-consistency distribution in Fig. mb). Here 
we see that, although the mode of the distribution is still about 0.5, only 10% of 
the matches have a normalized distance less than 1, and only 42% of the matches 
have a normalized distance less than 10. 

Note that the distributions illustrated above are not well modelled using 
Gaussian distributions because of the predominance of outliers (especially in the 
tree canopy example) . This is why we have chosen to compute the full distribu- 
tion rather than use its variance as a summary. 

3.3 Conditionalization 

As mentioned in the introduction, the global self-consistency distribution, while 
useful, is only a weak estimate of the accuracy of the algorithm. This is clear 
from the above examples, in which the unconditional self-consistency distribution 
varied considerably from one scene to the next. 

However, we can compute the self-consistency distribution for matches having 
a given “score” (such as the MDL-base score described in detail below). This 
is illustrated in Fig. [D^c) using a scatter diagram. The scatter diagram shows a 
point for every pair of matches, the x coordinate of the point being the larger 
of the scores of the two matches, and the y coordinate being the normalized 
distance between the matches. 

There are several points to note about the scatter diagrams. First, the terrain 
example (top row) shows that most points with scores below 0 have normalized 
distances less than about 1. Second, most of the points in the tree canopy ex- 
ample (bottom row) are not self-consistent. Third, none of the points in the tree 
canopy example have scores below 0. Thus, it would seem that this score is able 
to segregate self-consistent matches from non-self-consistent matches, even when 
the scenes are radically different (see Sect. I5.3(l . 

4 Projection Normalization 

To apply the self-consistency method to a set of images, all we need is the set 
of projection matrices in a common projective coordinate system. This can be 
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obtained from point correspondences using projective bundle adjustment fTRlMj 
and does not require camera calibration. The Euclidean distance is not invariant 
to the choice of projective coordinates, but this dependance can often be reduced 
by using the normalization described below. Another way to do so, which actually 
cancels the dependance on the choice of projective coordinates, is to compute the 
difference between the reprojections instead of the triangulations, as described 
in more detail in El. This, however, does not cancel the dependance on the 
relative geometry of the cameras. 

4.1 The Mahalanobis Distance 

Assuming that the contribution of each individual match to the statistics is the 
same ignores many imaging factors like the geometric configuration of the cam- 
eras and their resolution, or the distance of the 3D point from the cameras. There 
is a simple way to take into account all of these factors, applying a normaliza- 
tion which make the statistics invariant to these imaging factors. In addition, 
this mechanism makes it possible to take into account the uncertainty in camera 
parameters, by including them into the observation parameters. 

We assume that the observation error (due to image noise and digitalization 
effects) is Gaussian. This makes it possible to compute the covariance of the 
reconstruction given the covariance of the observations. Let us consider two 
reconstructed estimates of a 3-D point, M\ and M 2 to be compared, and their 
computed covariance matrices A\ and A 2 - We weight the squared Euclidean 
distance between Mi and M 2 by the sum of their covariances. This yields the 
squared Mahalanobis distance: (Mi — M 2 )^(/li -I- /l 2 )“^(Mi — M 2 ) . 

4.2 Determining the Reconstruction and Reprojection Covariances 

If the measurements are modeled by the random vector x, of mean Xq and of 
covariance Ax, then the vector y = /(x) is a random vector of mean is /(xq) 
and, up to the first order, covariance J/(xq)AxJ/(xo)^, where J/(xo) is the 
Jacobian matrix of /, at the point xg. 

In order to determine the 3-D distribution error in reconstruction, the vector 
X is defined by concatenating the 2-D coordinates of each point of the match, ie 
[a;i, j/i, X 2 , j/ 2 , ■ • • ?/n] and the result of the function is the 3-D coordinates 

X, Y, Z of the point M reconstructed from the match, in the least-squares 
sense. The key is that M is expressed by a closed-form formula of the form 
M = (L^L)~^L^b, where L and b are a matrix and vector which depend on 
the projection matrices and coordinates of the points in the match. This makes 
it possible to obtain the derivatives of M with respect to the 2n measurements 
Wi,i = 1. . ,n,w = x,y. We also assume that the errors at each pixel are in- 
dependent, uniform, and isotropic. The covariance matrix Ax is then diagonal, 
therefore each element of Am can be computed as a sum of independent terms 
for each image. 

The above calculations are exact when the mapping between the vector of 
coordinates of rrii and M (resp. m' and M') is linear, since it is only in that 
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case that the distribution of M and M' is Gaussian. The reconstruction oper- 
ation is exactly linear only when the projection matrices are affine. However, 
the linear approximation is expected to remain reasonable under normal view- 
ing conditions, and to break down only when the projection matrices are in 
configurations with strong perspective. 

5 Experiments 

5.1 Synthetic Data 

In order to gain insight into the nature of the normalized self-consistency distri- 
butions, we investigate the case when the noise in point localization is Gaussian. 

We first derive the analytical model for the self-consistency distribution in 
that case. We then show, using monte-carlo experiments that, provided that 
the geometrical normalization described in Sec E is used, the experimental self- 
consistency distributions fit this model quite well when perspective effects are 
not strong. A consequence of this result is that under the hypothesis that the 
error localization of the features in the images is Gaussian, the difference self- 
consistency distribution could be used to recover exactly the accuracy distribu- 
tion. 

Modeling the Gaussian Self-Consisteney Distributions The squared Mahalanobis 
distance in 3D follows a chi-square distribution with three degrees of freedom: 

■ 

In our model, the Mahalanobis distance is computed between M, M', recon- 
structions in 3D, which are obtained from matches nii, m' of which coordinates 
are assumed to be Gaussian, zero-mean and with standard deviation tr. If M, 
M' are obtained from the coordinates Wj, m' with a linear transformation A, 
A', then the covariances are cr^A'A'^. The Mahalanobis distance follows 

the distribution: 

dg = . (1) 

Using the Mahalanobis distance, the self-consistency distributions should be 
statistically independent of the 3D points and projection matrices. Of course, if 
we were just using the Euclidean distance, there would be no reason to expect 
such an independence. 

Comparison of the Normalized and Unnormalized Distributions. To explore the 
domain of validity of the first-order approximation to the covariance, we have 
considered three methods to generate random projection matrices: 

1. General projection matrices are picked randomly. 

2. Projection matrices are obtained by perturbing a fixed, realistic matrix 
(which is close to affine). Entries of this matrix are each varied randomly 
within 500% of the initial value. 
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3. Affine projection matrices are picked randomly. 

Each experiment in a set consisted of picking random 3D points, random 
projection matrices according to the configuration previously described, pro- 
jecting them, adding random Gaussian noise to the matches, and computing the 
self-consistency distributions by labelling the matches so that they are perfect. 

To illustrate the invariance of the distribution that we can obtain using the 
normalization, we performed experiments where we computed both the normal- 
ized version and the unnormalized version of the self-consistency. As can be 
seen in Fig. |21 using the normalization reduced dramatically the spread of the 
self-consistency curves found within each experiment in a set. In particular, in 
the two last configurations, the resulting spread was very small, which indicates 
that the geometrical normalization was successful at achieving invariance with 
respect to 3D points and projection matrices. 




random general projections perturbed projections random affine projections 
Fig. 2. Un-normalized (top) vs normalized (bottom) self-consistency distributions. 



Comparison of the Experimental and Theoretical Distributions. Using the Ma- 
halanobis distance, we then averaged the density curves within each set of ex- 
periments, and tried to fit the model described in Eq. Qto the resulting curves, 
for six different values of the standard deviation, a = 0.5, 1, 1.5, 2, 2.5, 3. As illus- 
trated in Fig. El the model describes the average self-consistency curves very well 
when the projection matrices are affine (as expected from the theory), but also 
when they are obtained by perturbation of a fixed matrix. When the projection 
matrices are picked totally at random, the model does not describe the curves 
very well, but the different self-consistency curves corresponding to each noise 
level are still distinguishable. 
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perturbed projections random afBne projections 

Fig. 3. Averaged theoretical (solid) and experimental (dashed) curves. 



5.2 Comparing Two Algorithms 

The experiments described here and in the following section are based on the 
application of stereo algorithms to seventeen scenes, each comprising five images, 
for a total of 85 images and 170 image pairs. At the highest resolution, each image 
is a window of about 900 pixels on a side from images of about 9000 pixels on 
a side. Some of the experiments were done on gaussian-reduced versions of the 
images. These images were controlled and bundle-adjusted to provide accurate 
camera parameters. 

A single self-consistency distribution for each algorithm was created by merg- 
ing the scatter data for that algorithm across all seventeen scenes. In previous 
papers, nmni, we compared two algorithms, but using data from only four im- 
ages. By merging the scatter data as we do here, we are now able to compare 
algorithms using data from many scenes. This results in a much more compre- 
hensive comparison. 

The merged distributions are shown in Fig. ^ as probability density func- 
tions for the two algorithms. The solid curve represents the distribution for our 
deformable mesh algorithm |3 , and the dashed curve represents the distribution 
for the stereo algorithm described above. 




Fig. 4. Comparing two stereo algorithms (Mesh vs Stereo) using the self-consistency 
distributions. 
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Comparing these two graphs shows some interesting differences between 
the two algorithms. The deformable mesh algorithm clearly has more outliers 
(matches with normalized distances above 1), but has a much greater proportion 
of matches with distances below 0.25. This is not unexpected since the strength 
of the deformable meshes is its ability to do very precise matching between im- 
ages. However, the algorithm can get stuck in local minima. Self-consistency now 
allows us to quantify how often this happens. 

But this comparison also illustrates that one must be very careful when 
comparing algorithms or assessing the accuracy of a given algorithm. The dis- 
tributions we get are very much dependent on the scenes being used (as would 
also be the case if we were comparing the algorithms against ground truth — the 
“gold standard” for assessing the accuracy of a stereo algorithm). In general, 
the distributions will be most useful if they are derived from a well-defined class 
of scenes. It might also be necessary to restrict the imaging conditions (such as 
resolution or lighting) as well, depending on the algorithm. Only then can the 
distribution be used to predict the accuracy of the algorithm when applied to 
images of similar scenes. 



5.3 Comparing Three Scoring Functions 

To eliminate the dependency on scene content, we propose to use a score asso- 
ciated with each match. We saw scatter diagrams in Fig. me) that illustrated 
how a scoring function might be used to segregate matches according to their 
expected self-consistency. 

In this section we will compare three scoring functions, one based on Mini- 
mum Description Length Theory (the MDL score. Appendix IXIl. the traditional 
sum-of-squared-differences (SSD) score, and the SSD score normalized by the 
localization covariance (SSD/GRAD score) |S|. All scores were computed using 
the same matches computed by our deformable mesh algorithm applied to all 
image pairs of the seventeen scenes mentioned above. The scatter diagrams for 
all of the areas were then merged together to produce the scatter diagrams show 
in Fig. El 




MDL SSD/Grad SSD 

Fig. 5. Scatter diagrams for three different scores. 
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The MDL score has the very nice property that the confidence interval (as 
defined earlier) rises monotonically with the score, at least until there is a paucity 
of data, when then score is greater than 2. It also has a broad range of scores 
(those below zero) for which the normalized distances are below 1, with fewer 
outliers than the other scores. 

The SSD / GRAD score also increases monotonically (with perhaps a shallow 
dip for small values of the score), but only over a small range. 

The traditional SSD score, on the other hand, is distinctly not monotonic. It 
is fairly non-self-consistent for small scores, then becomes more self-consistent, 
and then rises again. 

Another way that we can compare the scores is with a measure we call the 
efficiency of the scoring function. This is the number of match pairs for which 
the confidence interval is below some value d divided by the total number of 
match pairs having normalized distances less than d. Intuitively, the efficiency 
represents how well the scoring function can predict that match pairs will have 
normalized differences below some value given just the score. An ideal score 
would have an efficiency of 1 for all values of d. 

The 99% efficiency of all three scores is illustrated in Fig.El Note that, overall, 
the MDL score is somewhat more efficient than the SSD/GRAD score, both of 
which are significantly more efficient than the SSD score. 




Fig. 6. Comparing three scoring schemes (MDL vs. SSD/GRAD vs. SSD) using the 
efficiency measure. 



Our previous publication HU compared two scoring function by comparing 
their cumulative distributions in two different scenes. Here we compare scores 
using the merged data from many scenes using confidence-interval and efficiency 
graphs, again providing a more comprehensive comparison than was possible 
before. 

6 Conclusion and Perspectives 

We have proposed the self-consistency methodology as a means of estimating the 
accuracy and reliability of point-correspondence algorithms algorithms, compar- 
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ing different algorithms, and comparing different scoring functions. We have pre- 
sented a detailed prescription for applying this methodology to multiple-image 
point-correspondence algorithms, without any need for ground truth or camera 
calibration, and have demonstrated it’s utility in two experiments. 

The self-consistency distribution is a very simple idea that has powerful con- 
sequences. It can be used to compare algorithms, compare scoring functions, 
evaluate the performance of an algorithm across different classes of scenes, tune 
algorithm parameters, reliably detect changes in a scene, and so forth. All of this 
can be done for little manual cost beyond the precise estimation of the camera 
parameters and perhaps manual inspection of the output of the algorithm on a 
few images to identify systematic biases. 

Readers of this paper are invited to visit the self-consistency web site to 
download an executable version of the code, documentation, and examples at 
http://www.ai.sri.com/sct/ described in this paper. 

Finally, we believe that the core idea of our methodology, which examines 
the self-consistency of an algorithm across independent experimental trials, can 
be used to assess the accuracy and reliability of algorithms dealing with a range 
of computer vision problems. This could lead to algorithms that can learn to be 
self-consistent over a wide range of scenes over a wide range of scenes without 
the need for external training data or “ground truth.” 

A The MDL Score 

Given N images, let M be the number of pixels in the correlation window and 
let g/ be the image gray level of the pixel observed in image j. For image j, 
the number of bits required to describe these gray levels as IID white noise can 
be approximated by: 

Cj = M (log (jj + c) , (2) 

where Uj is the measured variance of the 5 / and c = (1/2) log(27re). 

Alternatively, these gray levels can be expressed in terms of the mean gray 
level 'gl across images and the deviations gj —gi from this average in each indi- 
vidual image. The cost of describing the means, can be approximated by 

C = M(logCT -I- c) , (3) 

where a is the measured variance of the mean gray levels. Similarly the coding 
length of describing deviations from the mean is given by 

Cf = M{loga^ + c) (4) 

where is the measured variance of those deviations in image j. Note that, 
because we describe the mean across the images, we need only describe N — 1 
of the Cj. The description of the iVth one is implicit. 

The MDL score is the difference between these two coding lengths, normalized 
by the number of samples, that is 

Loss=C+ - E ■ (5) 
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When there is a good match between images, the giicj^jy have a small variance. 

Consequently the Cj should be small, C should be approximately equal to any of 

the Cj and Loss should be negative. However, Cj can only be strongly negative 

if these costs are large enough, that is, if there is enough texture for a reliable 

match. See |0| for more details. 
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Discussion 

Andrew Fitzgibbon: The metric you propose sounds a little like a quantifica- 
tion of Jacobs work on the generic view assumption, where solutions to shape 
from shading algorithms are consistent if, when you move the camera a little bit, 
they don’t change much. Are you aware of that, and have you looked at it? 
Yvan Leclerc: I’m well aware of the work. What they do is ask as a thought 
experiment, if I was to change my viewpoint a little bit, how would the output 
of the algorithm change. That’s the generic viewpoint constraint. Here what I 
do is look at how the real algorithm behaves under real changes in viewpoint 
provided by real new imagery. Also, as I understand the generic viewpoint thing, 
you look at local perturbations in viewpoint or viewing conditions, see how that 
affects the output, and pick the output that is most generic — that gives you 
the smallest change. The philosophy is a little different from what we are trying 
to do here. There’s a relationship between them, but I’m not sure that I can go 
further than that right now. 

Joe Mundy: The stereo case and the 3D model you reconstruct is pretty well 
defined and constrained. What if there was a large space of possible solutions, 
all of which could explain the data? How would you approach that? 

Yvan Leclerc: Well, I’m not sure exactly what you mean by that, but the 
methodology as I’ve described it should still let you pick out the sets of hy- 
potheses that are consistent. I certainly agree that for any single image or image 
pair there may be many hypotheses that can explain them. But as you get more 
and more observations, more and more views, the set of hypotheses that are 
consistent with the data should shrink until you get down to a core set that are 
consistent with all the images. That would be my expectation. 

Rick Szeliski: Perhaps this is a follow-on from Joe’s question. In stereo match- 
ing when you have textureless regions there are a lot of equivalent hypotheses, 
each as good as the others. So depth predictions from independent groups of 
measurements are unlikely to agree with each other. Is there a mechanism where 
the algorithms could say, this is my estimate but I have very low confidence in 
it, or perhaps even have a confidence interval. 

Yvan Leclerc: Exactly, that’s what the score is for. For each pair of matches, 
the algorithm supplies a score. Here I used the MDL score, for which scores near 
zero usually correspond to textureless regions. So you can tell which matches are 
in textureless regions, and which have large outliers. If you want, you can just 
keep the matches with good scores. So you have a fairly general algorithm that 
is able to separate good matches from bad ones. What is nice here is that you 
can guess “Oh yes, the MDL score ought to do that” and then actually verify 
that it does with the methodology. 

Bill Triggs: Given your experience with testing stereo on these difficult scenes 
with dirt ground and tree cover and things like that, which correlation or pixel 
correspondence method works the best? 
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Yvan Leclerc: So far I’ve only compared the two examples that I gave, a sim- 
ple stereo algorithm and the deformable patches. As I showed, the deformable 
patches tend to give a much more accurate solution in some circumstances where 
the score is negative. When the surface is somewhat smooth and it isn’t like a tree 
canopy, they are much better than traditional stereo. What I’d like to do is to 
have people try their stereo algorithms on sets of images and use self-consistency 
to characterize the results. Then we’d have a nice quantitative measure to com- 
pare different algorithms under different conditions, for various classes of images. 
That’s why I’m providing the software, so that people can download it and try 
it for themselves. (Provisional site: http://www.ai.sri.com/sct/). 
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Abstract. The recently proposed CONDENSATION algorithm and its variants 
enable the estimation of arbitrary multi-modal posterior distributions that 
potentially represent multiple tracked objects. However, the specific state 
representation adopted in the earlier work does not explicitly supports 
counting, addition, deletion and occlusion of objects. Furthermore, the 
representation may increasingly bias the posterior density estimates towards 
objects with dominant likelihood as the estimation progresses over many 
frames. In this paper, a novel formulation and an associated 
CONDENSATION-like sampling algorithm that explicitly support counting, 
addition and deletion of objects are proposed. We represent all objects in an 
image as an object configuration. The a posteriori distribution of all possible 
configurations are explored and maintained using sampling techniques. The 
dynamics of configurations allow addition and deletion of objects and handle 
occlusion. An efficient hierarchical algorithm is also proposed to approximate 
the sampling process in high dimensional space. Promising comparative results 
on both synthetic and real data are demonstrated. 



1 Introduction 

Tracking multiple objects in videos is a key problem in many applications such as 
video surveillance, human computer interaction, and video conferencing. It is also a 
challenging research topic in computer vision. Some difficult issues involved are 
cluttered background, unknown number of objects, and complicated interaction 
between objects. Many tracking algorithms can be interpreted in a probabilistic 
framework called hidden Markov model (HMM) [1], explicitly or implicitly. 

As shown in Fig.l, the states of an object x^e X at different time instances 
f = l,2,...n form a Markov chain. State x, contains object deformation parameters 
such as positions and scale factors. At each time instance t , conditioned on x , , 
observation z, is independent of other previous object states or observations. This 
model is summarized as 

P(xi,x2,...x„;zi,z2,...z„) = 

" ( 1 ) 
F(Xi)P(Zj I Xj)]~[[P(x, I X,_^)P(z, I X,)] 

i=2 



B. Triggs, A. Zisserman, R. Szeliski (Eds.): Vision Algorithms’99, LNCS 1883, pp. 53-68, 2000. 
© Springer-Verlag Berlin Heidelberg 2000 
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The tracking problem can be posed as the computation of the a posteriori distribution 
P(x, |Z,) for given observations Z, = {zj,Z 2 ,...,z,} . When a single object is 
tracked, the maximum a posteriori (MAP) solution is desired. If both the object 
dynamics P(x, |x,_i) and the observation likelihood P{z,\x,) are Gaussian 
distributions, P(x, |Z,) is also a Gaussian distribution. The MAP solution is 
E{x, \Z,). 

In order to compute P(x, |Z,), a forward algorithm [1] is applied. It computes 
P(x, |Z,) based on P{x,_^ |Z,_j) inductively and is formulated as 
P(X, I Z,) oc p{z^ I x,)P{x, I Z,_i) 

I r I I (2) 

= P(.z, \ x,)J P{x, \ x,_i)P(x,_i I Z,_i)dx,_i 

Using this formula, the well-known Kalman filter that computes E(x, |Z,) for a 
Gaussian process can be derived [2]. When multiple objects present, if the number of 
objects is fixed and the posterior of each object is Gaussian, similar solution in 
analytic form is obtained. If the number of objects may change over time, data 
association method such as multiple-hypothesis tracking (MHT) [3] has to be used. 
The complexity of MHT algorithm is exponential with respect to the time and 
pruning techniques are necessary for real applications [4]. 




Fig. 1. The hidden Markov model. 



When the analytic form of either P(x, | x, J or P(z, | x, ) is not available, sampling 
techniques such as the CONDENSATION algorithm [5] are preferred. The idea is to 
represent P(x, | Z,) with samples and to propagate the posterior distribution over 
time by computing the likelihood function P(z, |x,) and simulating the dynamics 
P(x, |x, j). In [6], a variance reduction method called importance sampling 
algorithm is used to reduce the number of samples and to handle data associate 
problems. A more recent paper [8] deal with fixed number of object using a sampling 
scheme. 

The original CONDENSATION algorithm and its variants use a single object state as 
the basic state representation. Presence of multiple objects is implicitly contained in 
the multiple peaks of the posterior distribution. When the CONDENSATION 
algorithm is applied to such a representation, it is very likely that a peak 
corresponding to the dominant likelihood value will increasingly dominate over all 
other peaks when the estimation progresses over time. In other words, a dominant 
peak is established if some objects obtain larger likelihood values more frequently. If 
the posterior is propagated with fixed number of samples, eventually, all samples will 
be around the dominant peak. Dominant peak may occur in many model based 
tracking algorithms. For example, a head-shoulder contour deformable model may fit 
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one person better than another in most frames of a video sequence. This phenomenon 
is further illustrated here with a synthetic example. 

Fig. 6 shows two frames of the synthetic sequence. More details of the sequence can 
be found in Section 6. In Fig. 9, the tracking results using the original 
CONDENSATION algorithm are illustrated. Since the likelihood function is biased 
to certain objects, the differences between these objects and the other objects in the 
posterior distribution increase exponentially with respect to the number of frames 
observed. In frame 15 (Fig. 9b), three peaks can be identified. In frame 25 (Fig. 9c), 
one object looses most of its samples because of its constantly relatively smaller 
likelihood. . In frame 65 (Fig. 9d), another object vanishes due to its smaller 
likelihood. This phenomenon can also be observed in Fig. 9e and Fig. 9f. 

Besides the dominant peak problem, the above example also illustrates that the events 
such as addition, deletion, and occlusion can not be naturally handled. In Fig. 9d, a 
new object appears but no samples are allocated to it. In Fig. 9h, an object 
disappears, but the samples are not redistributed to the other object. 

Importance sampling [6] is a data-driven mechanism that may alleviate some of the 
above problems. However, in order to maintain and update the count and state of 
multiple objects explicitly, a new representation is required. 

It should be noticed that the limitation described here is not of the CONDENSATION 
process but of the state representation that is used by the tracker. In this paper we 
present a new representation and apply a CONDENSATION-Iike sampling algorithm 
for the estimation of the joint distribution of multiple objects under the presence of 
clutter, varying object counts and appearance/disappearance of objects. 



2 Tracking Multiple Objects 

Our goal is to (i) track multiple instances of an object template, (ii) maintain an 
expected value of the number of objects at any time instant, and (iii) be resilient to 
clutter, occlusion/deocclusion and appearance/disappearance of objects. In order to 
be able to represent multiple objects, we enhance the basic representation by 
representing all objects in the image as an object configuration (the term 
configuration is used in the rest of this paper for conciseness). A configuration is 
represented by a set of object deformation parameters e X"' , 

where m is the number of objects. If K is the maximum possible number of objects 

K 

in an image, the configuration space is I 1'”=° . Given the enhanced representation, 

the goal is to compute the a posteriori probability of the configuration parameters 
P(s^ |ZJ instead of the a posteriori probability of object parameters P(x, |ZJ. 
The posterior for a configuration is given by 

P(s,\Z,)^P(z,\s,)P(s,\Z,_^) 

= P(z, U,)jF(j, |Z,_i)*,_, 



( 3 ) 
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To estimate this distribution, the configuration dynamics and the 

configuration likelihood P(z, | s,) need to be modeled. Then a CONDENSATION- 
like sampling algorithm can be applied. Distribution P(s, describes the 

temporal behavior of a configuration in terms of how each of the individual objects 
changes, how a new object is introduced, how an existing object is deleted, and how 
to handle occlusion. The likelihood P(z, | s,) measures how well the configuration 
fits the current observation. 



2.1 Dynamics of a Configuration - P{s, \ s,_^) 

decomposed into object-level and configuration-level dynamics. 
Suppose contains m objects, or = {x,_, . Object-level 

dynamics P(T, , |.v, ,, ) is first applied to predict the behavior of each object. The 
resulted configuration is s, = { j , x^_■^ 2 , • • • , „ } • Then, the configuration-level 

dynamics P{s, \ s,) will perform the object deletion and addition. 

2.1.1 Object-Level Dynamics P(T,, | 

A commonly used model is: 

x,j = Ax,_j -I- w (4) 

where w : N(Q, E) is a Gaussian noise and A is the state transition matrix. 
According to this model, P(x,Jx, j.) has Gaussian distribution N(Ax^_^.,l,) . 



2.1.2 Configuration-Level Dynamics - P(s, \ s,) 

The configuration-level dynamics should allow deletion and addition of objects in s, . 
Domain-dependent information should be brought in to model these events. For 
instance, knowledge about deletion and addition can be described as spatial birth and 
death processes [9]. 

Deletion probability fi (x, y) is defined as a function of the image coordinates (x, y) . 
For example, p (x, y) may have higher values around the scene boundaries because 
objects usually disappear at those locations. For an object at (x, y ) , its chance of 
survival in the current frame is 1- p(x,y) . When occlusion happen in an area with 
low deletion probability, the occluded object is unlikely to be deleted. 

By the same token, addition probability is defined as a(x, y) . Since new objects 
always cause image changes, motion blobs are used to construct a(x, y) . For video 
with static background, motion blobs are detected by image differencing method. 
a(x, y) is non-zero only in the regions of the motion blobs. For the case of a pan/tilt 
or moving camera, the blob detection may be accomplished using background 
alignment techniques and change detection algorithms [10]. 
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Fig. 2. Configuration-level dynamics: (a) a video frame with static background (b) 
deletion probability (c) motion blobs. 



In Fig. 2a, a frame from a test video clip is shown. Fig. 2b shows the deletion 
probability fi{x,y) . The highest value in the image is around the border. The 

motion corresponding blobs are shown in Fig. 2c. The addition probability a{x, y) is 
0.01 in these blobs. 



2.2 Likelihood of a Configuration 

P(z, I J,) is a very complicated distribution. One possible type of approximation is 
observation decomposition [7]. The image is spatially decomposed into small regions 
and the likelihood is formulated as the product of local likelihood. Since the 
configuration is not decomposed, it will lead to algorithms manipulating in a high- 
dimensional configuration space. In this paper, we propose an approximation using 
configuration decomposition. The likelihood is replaced by an energy function and 
decomposed into object-level and configuration-level terms. The energy function is 
designed to gives the more desired configurations higher values. Intuitively, three 
factors should be considered. The first factor is, in average, how well individual 
objects in a configuration fit the observation. This is noted as the object-level 
likelihood. For example, a contour matcher may be applied to calculate the 

likelihood of each object in a configuration, and their geometric average is computed 
as the object-level likelihood for that configuration. The average is taken to make it 
independent of the number of objects. The second factor is how much of the 
observation has been explained by the configuration. This is noted as the 
configuration coverage. The third factor is the compactness. It is always desirable to 
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explain the observation using minimum number of objects. All these three factors are 
indispensable. In Fig. 3, the likelihood of some configurations is illustrated. 



2.2.1 Object-Level Likelihood 

For a given object, the likelihood L(z, ,x, ,) measures how well the image data 

supports the presence of this object. The likelihood can be defined as any reasonable 
match measures, e.g. the normalized correlation between the object and the image, or 
the Chamfer matching score for a contour representation of an object. For a 
configuration with m objects, the object-level likelihood is computed as the 
geometric average of L(z, , x, , ) . More precisely. 



X = 






(5) 




(c) (d) 

Fig. 3. Likelihood of a configuration (a) highest (b) low: interested region is not 
covered (c) low: too many object are used to explain the data (d) low: likelihood of 
individual objects are low. 



2.2.2 Configuration Coverage 

In general, it is difficult to compute configuration coverage. However, for moving 
object tracking, motion blobs are good cues. If we assume all the motion blobs in a 
frame are caused by the objects to be tracked, the configuration coverage can be 
computed as the percentage of the motion blob areas being covered by objects. It is 
formulated as 
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where A is the union of motion blobs. B. is the area covered by object i in a 
configuration, b is a small positive constant used to avoid zero division. If | A |= 0 , 
7 = 1 . 

2.2.3 Configuration Compactness 

The compactness is defined as the ratio between data that has been explained and the 
amount of cost. In terms of motion blobs, it can be computed as 

I An(05,.) + c I 

( 7 ) 

(lU-B, l+a) 

i=l 

where a is a small positive constant like b . If too many objects are used to explain a 
small area, ^ will be small, c is a positive number so that when | A |= 0 , the 
configurations with smaller number of objects have higher score. 

Finally, the overall likelihood of configuration s, is approximated by 

= ( 8 ) 
where j5 , a positive constant that controls the relative importance of the last two 
terms. It should be mentioned that, depending on the application, different cues may 
be used to compute the configuration coverage and compactness. For instance, color 
blobs with skin colors can be applied for face tracking. 



3 A Sampling Algorithm 

Given the above formulation of configuration dynamics and likelihood, we now 
present a CONDENSATION-like algorithm to estimate the a posteriori configuration 
densities. Subsequently, we show how the standard CONDENSATION algorithm can 
be approximated using a fast hierarchical algorithm. 

Suppose nj = P{z, \ sj ) , j = 1,2, ...R^ is the likelihood of the j th configuration sj , 
where is the total number of configuration samples. R^ is a constant in the 
algorithm. 

For j from 1 to R^, perform the following three steps. 

Step 1. At time instance r > 1 , randomly select the yth configuration sample from 
all R^ samples , i = 1,2, ...R^ in the previous frame according to their 
corresponding likelihood , i = 1,2, ...R,. 

Step 2. Apply the dynamics to predict the current configuration s^ from using 

P(^! I s'L ) 

Step 3: Compute the new likelihood ;r/ = P(z, \ s‘, ) 
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To initialize this process, is sampled randomly in the configuration space 

■ For example, if the maximum possible number of objects in a 
configuration is K = 9 and 1000 configuration samples are initiated (R^ =1000), 
then for the 10 categories of configurations that contain 0 to 9 objects, 100 samples 
are assigned to each category. For a configuration sample with m objects, the 
parameters of each object are randomly chosen in the parameter space. The 
configuration likelihood is then computed. If the likelihood of a configuration is 
high, according to Step 1, in the next iteration, this configuration is likely to be 
selected. The expected number of objects in a frame can also be computed as 

I ni , where | s/ | is the number of objects in s^, . 

The above algorithm samples the a posteriori distribution of the configurations in a 
high dimensional space X" . If there are m objects in the scene, the posterior 

has to be sampled in the space X ” . To maintain the same sample density, the number 
of samples needs to be exponential with respect to m , which makes the algorithm 
impractical. Importance sampling techniques [6] alleviate the problem to some extent 
by reducing the volume of the parameter space X , however, the dimensionality of 
the sampling space is not reduces. A possible solution to this problem is to sample 
from configurations with high likelihood. More specifically, in the first step, i-'j, is 

only drawn from j with relatively large . This strategy makes the sampling 
process focus on the posterior distribution around the MAP solution, which is 
desirable because the goal of the tracking process is to actually obtain the MAP 
configuration. A problem of this method is that the tracker is easily trapped by local 
maximum solutions. 



4 An Efficient Hierarchical Sampling Algorithm 

In this section, we describe an efficient hierarchical algorithm that decouples the 
sampling process into two stages: local configuration sampling stage and global 
configuration sampling stage. The local sampling stage track the motion of 
individual objects, while the configuration sampling process handles object addition, 
deletion. Strictly speaking, it does not propagate the configuration posterior 
distribution. It reinforces the likelihood portion to some extent so that the tracker is 
less likely to be trapped by local optimal solutions. To explain the algorithm more 
clearly, examples will be provided for each step of the algorithm. 

The first step is selecting new configuration samples based on the previous samples 
and their corresponding likelihood (see Section 3). For example, in Fig. 4a, four 
configurations are selected. They contain two, three, four, and four objects 
respectively. There are total of thirteen objects in these four configurations. 
Different shapes are used in the figure to distinguish objects in different 
configurations. 
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The second step is local sampling of the object-level a posteriori distribution 
conditioned on given configurations. More specifically, the image is first partitioned 
into non-overlapping regions and configurations are broken into sub-configurations 
according to the partition. For example, in Fig. 4b, the configuration marked by "! " 
is decomposed into three sub-configurations in region 2, 3, and 4. The sub- 
configuration in region 4 contains two objects. In region 4, there are three other sub- 
configurations containing 1, 1, and 2 objects respectively. After the partitioning, in 
each region, object-level dynamics is applied to every object and likelihood is 
computed for each sub-configuration (Fig. 4c). Note that the configuration-level 
dynamics such as object deletion and object addition is not performed in this step. 
Next, in each image region, all sub-configurations with the same number of objects 
are grouped together. According to their likelihood, they are sampled to produce the 
same number of new sub-configurations. These samples are then assigned back to the 
global configurations randomly (because there is no identity left after sampling). For 
example, in region 4, based on the two resulted two-object sub-configurations in Fig. 
4c and their corresponding likelihood, sampling process is applied to obtain two 
"new" sub-configurations (Fig. 4d). Actually, these two sub-configurations are 
identical because the sub-configuration with higher likelihood has been selected 
twice. The resulted configurations are assigned arbitrarily back to the global 
configuration. 

In the third step, configuration-level dynamics computation (see Section 2.1.2) is 
applied and likelihood is computed for each configuration (see Section 2.2). Fig. 4e 

shows the result after configuration-level dynamics being applied. A new object 

is added and a "#" object is deleted. 

The hierarchical tracking algorithm is summarized as follows: 

Step 1. Select configurations: At time t > 1 , select configuration samples. The yth 
configuration sample s'f is select randomly from all samples s' ^ , i = 1,2, ...R^ in 
the previous frame according to their likelihood /r' ^ , i = 1,2, ...R^. 
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(d) (e) 

Fig. 4. The hierarchical sampling algorithm for tracking multiple objects, (a) select 
configurations (b) partition configurations into sub-configurations (c) local object- 
level sampling (d) recover configurations from new sub-configurations (e) global 
configuration-level dynamics and likelihood computation. 
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Step 2. Local object-level sampling: Partition the 2D image into regions and break 
configurations into sub-configurations. In each region, apply object-level dynamics. 
For sub-configurations containing the same number of objects, do sampling according 
to their local configuration-level likelihood. Assign them randomly back to the 
global configurations. 

Step 3: Global configuration-level sampling: The configuration-level samples are 
recovered. The likelihood nf = P{z, \ sj ) is computed. Go to the next frame. 



5 Implementation 

The proposed hierarchical algorithm has been implemented on a Pentium II 400 MHz 
PC. It runs at 1 frame/s when 300 configuration samples are used on 320x240 video 
frames. 



5.1 Video Preprocessing 

Detecting motion blobs is an important step for computing configuration-level 
likelihood. Several methods such as background subtraction, two-image or three- 
image differencing algorithms are available. Three-image differencing method is 
used in our implementation [10]. 



5.2 Object Representation 

As shown in Fig. 5, a contour-plus-region representation is designed. To track 
multiple people, the head-shoulder contour in Fig. 5a is compared with the edge 
images in order to obtain object likelihood L{z,,x,,). The contour template is 

divided into several line segments. L(z,,x,.) is computed as the weighted average of 

the matching score for individual template contour segments. The regions of the 
template are represented by rectangles and are used to compute y and ^ using 
Equation (6) and (7). The parameter j5 , which controls the relative importance of 
the object-level likelihood and the configuration-level likelihood, equals 1.5. 



5.3 The Hierarchical Algorithm 

A fixed number of configuration samples are used in the algorithm. These samples 
are evenly distributed to configurations with different number of objects at the 
initialization stage. For the first frame, several iterations of the algorithm are 
executed to obtain the initial prior (with different dynamics). The size of each local 
image region in our implementation is 10x10 pixels. 
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(a) (b) 

Fig. 5. (a) A simple contour-region representation of people, (b) a coarse 2D 

contour-region representation of spherical objects. 



6 Experimental Results 

6.1 The Synthetic Sequence 

Both a synthetic image sequence and natural video data are tested. The synthetic 
sequence contains four moving objects of similar shapes (Fig. 6). They approximate 
a circle with six, seven, eight, and thirty laterals. These objects undergo only 
translations in this test sequence. A translation invariant object-level likelihood 
function is computed based on a generic contour model and a contour matching 
algorithm. The likelihood values of these four objects remain consistent over time 
and have small differences due to their different shapes. This setup resembles many 
model based trackers in the way that a generic model (built either by learning or 
designing) is used to track an entire class of objects. These objects enter and leave 
the scene at the image boundaries. There is one instance of object occlusion in this 
sequence. 

The background image is formed by Gaussian noise. To simulate some random 
irrelevant moving objects, white noise is added to the background at two locations 
that gives rise to some spurious motions blobs. Finally, noise is added to the 
appearance of the moving objects. Quantitative analysis is conducted based on the 
tracking results and the actual number and positions of objects in each frame. 

For the synthetic sequence, we compared the results of the CONDENSATION 
algorithm and the hierarchical algorithm. In the CONDENSATION algorithm, 600 
object samples are used. In the latter one, 300 configuration samples are initialized. 
These 300 samples are evenly distributed in terms of number of objects in a 
configuration and object parameter values. 

In Eig. 9, the tracking results of the original CONDENSATION algorithm with a 
single object state representation are shown. Importance sampling is used in the first 
frame to obtain a better prior. Object samples are represented by white dots. In Fig. 
9i, marginal sample distributions on vertical image axis for every five frames are 
shown. As explained in Section 1, dominant peaks and inappropriate handling of 
object addition and deletion are observed. 

In Fig. 10, the corresponding results of the hierarchical tracking algorithm with the 
multiple object representation are demonstrated. Four distinct trajectories are 
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observed in Fig. lOi. Events such as addition, deletion, and occlusion can be easily 
distinguished. 




(C) (d) 



Fig. 6. (a)(b) Two frames in the synthetic sequence (c) the edge map and the tracking 
result (white dots are object samples) (d) motion blohs. 



As mentioned in Section 3, hy applying the new representation, expected number of 
objects in each frame is computed from configuration samples. In Fig. 7, the 
expected number of objects in each frame using the hierarchical algorithm is shown. 
In the same figure, the actual number of objects is also drawn. (The first 20 iterations 
are used the algorithm initialization and are not significant in the comparison). The 
number of objects in most of frames is correctly estimated, even during the occlusion 
period. 



I 
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Fig. 7. Object counts in the synthetic sequence. 



6.2 Tracking Multiple People 

Both algorithms have been tested on real video sequences. For tracking multiple 
people, a simple contour-plus-region template is designed (Fig. 5a). Only translation 
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is modeled in the transformation. A frame is shown in Fig. 2a. Its corresponding 
motion hlobs are shown in Fig. 2c. For this particular sequence, deletion is only 
allowed in the gray regions drawn in Fig. 2b. Fig. 8 demonstrates the tracking results 
in some frames. Four persons are simultaneously tracked. The number of persons in 
the scene is automatically estimated in the hierarchical algorithm. 




(c) (d) 

Fig. 8. The results of tracking multiple people. 



7 Conclusions 

The new representation proposed in this paper explicitly models multiple objects in a 
video frame as an object configuration. The events such as object addition, deletion, 
and occlusion are modeled in configuration-level dynamics. With this formulation, 
CONDENSATION-like tracking algorithms can be designed to propagate the 
configuration posterior. A hierarchical sampling algorithm is also proposed in this 
paper. Promising comparative experimental results of the CONDENSATION 
algorithm and the new algorithm on both synthetic and real data are demonstrated. 
Compared to the multiple-hypothesis tracking method, which is an approximation of 
the Viterbi algorithm based on local maximums of likelihood function, the proposed 
algorithm explores the likelihood function in the whole parameter space. However, 
the concept of configuration tracks needs to be introduced to fully model the data 
association over time. 
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Fig. 9. Results of the CONDENSATION algorithm in frame (a) 1 (b) 10 (c) 25 (d) 
65 (e) 85 (f) 90 (g) 110 (h) 140 and (i) the marginal sample distribution along the 
vertical image axis. Left side is the top of the images. 
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Discussion 

Tom Drummond: It's clear that insertion is harder than deletion because you have to 
generate a whole new set of parameters to describe the new object. Judging by your 
results, you allocated maybe 10-15% of the samples to insertion. Did you find that you 
needed to artificially amplify the probability of a new object to provide dense enough 
samples for insertion? 

Harpreet Sawhney: We selected the addition and deletion probabilities empirically, 
but there wasn't a lot of tuning as there are very few free parameters. 

Bill Triggs: You suggested combining multiple hypothesis tracking and 

Condensation. With classical MH tracking there's a combinatorial search over the 
joint hypotheses, to enforce the exclusion principle / unicity. How would you put that 
together with CONDENSATION, what are the trade-offs and what sort of computational 
complexities would you expect? 

Harpreet Sawhney: I haven't done any real experiments to explore this, but the trade- 
offs you mention are certainly the questions to ask for multiple hypothesis tracking. 
Something like CONDENSATION, but with more explicit data associations, might allow 
us to maintain the identity of object configurations over time without doing all the 
combinatorics that multiple hypotheses imply. But this is just a vague idea that we are 
beginning to explore. 

Olivier Faugeras: I think that the Bayesian formalisms that you and others have been 
using recently are just not the right way to do tracking, because you are really pushing 
reality into being Bayesian and using Bayes Rule. I don't think these probabilities are 
really attainable in practice. Have you considered an alternative approach based on 
variational calculus, using the very efficient and well developed and practically useful 
theory of partial differential equations? 

Harpreet Sawhney: No, I haven't thought about that approach. But I don't really 
agree that the probabilistic models are not the right ones. For one thing, in tracking the 
specific object models can be very complex and difficult to capture in a complete state 
description. You could have an ideal model, but real objects will vary from that both 
statistically and systematically. Secondly, when you need to discover new objects in 
the data, it seems to me that probabilistic representations offer an efficient method of 
capturing what is going on. I don't see how you would use a variational representation 
for doing something like tracking, but 1 am happy to talk to you about that. 
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Abstract. This paper presents a visual servoing system which incor- 
porates a novel three-dimensional model-based tracking system. This 
tracking system extends constrained active contour tracking techniques 
into three dimensions, placing them within a Lie algebraic framework. 
This is combined with modern graphical rendering technology to create 
a system which can track complex three dimensional structures in real 
time at video frame rate (25 Hz) on a standard workstation without spe- 
cial hardware. The system is based on an internal CAD model of the 
object to be tracked which is rendered using binary space partition trees 
to perform hidden line removal. The visible features are identified on-line 
at each frame and are tracked in the video feed. Analytical and statistical 
edge saliency are then used as a means of increasing the robustness of 
the tracking system. 



1 Introduction 

The tracking of complex three-dimensional objects is useful for numerous appli- 
cations, including motion analysis, surveillance and robotic control tasks. 

This paper tackles two problems. Firstly, the accurate tracking of a known 
three-dimensional object in the field of view of a camera with known internal 
parameters. The output of this tracker is a continuously updated estimate of the 
pose of the object being viewed. Secondly, the use of this information to close 
the loop in a robot control system to guide a robotic arm to a previously taught 
target location relative to a workpiece. This work is motivated by problems such 
as compensation for placement errors in robotic manufacturing. The example 
presented in this paper concerns the welding of ship parts. 

1.1 Model-Based Tracking 

Because a video feed contains a very large amount of data, it is important to 
extract only a small amount of salient information, if real-time frame (or field) 
rate performance is to be achieved HD- This observation leads to the notion 
of feature based tracking |2j in which processing is restricted to locating strong 
image features such as contours HHEI- 



B. Triggs, A. Zisserman, R. Szeliski (Eds.): Vision Algorithms’99, LNCS 1883, pp. 69-|^^ 2000. 
(c) Springer- Verlag Berlin Heidelberg 2000 



70 



T. Drummond and R. Cipolla 



A number of successful systems have been based on tracking the image con- 
tours of a known model. Lowe used the Marr-Hildreth edge detector to ex- 
tract edges from the image which were then chained together to form lines. These 
lines were matched and fitted to those in the model. A similar approach using 
the Hough transform has also been used m The use of two-dimensional im- 
age processing incurs a significant computational cost and both of these systems 
make use of special purpose hardware in order to achieve frame rate processing. 

An alternative approach is to render the model first and then use sparse one- 
dimensional search to find and measure the distance to matching (nearby) edges 
in the image. This approach has been used in RAPID m , Condensation PI 
and other systems mm- The efficiency yielded by this approach allows all 
these systems to run in real-time on standard workstations. The approach is also 
used here and discussed in more detail in Section U.M 

Using either of these approaches, most systems (except Condensation) then 
compute the pose parameters by linearising with respect to image motion. This 
process is reformulated here in terms of the Lie group SE(3) and its Lie algebra. 
This formulation is a natural one to use since the group SE(3) exactly represents 
the space of poses that form the output of the system. The Lie algebra of a group 
is the tangent space to the group at the identity and is therefore the natural 
space in which to represent differential quantities such as velocities and small 
motions in the group. Thus the representation provides a canonical method for 
linearising the relationship between image motion and pose parameters. Further, 
this approach can be generalised to other transformation groups and has been 
successfully applied to deformations of a planar contour using the groups GA(2) 
and P(2) 

Outliers are a key problem that must be addressed by systems which measure 
and fit edges. They frequently occur in the measurement process since additional 
edges may be present in the scene in close proximity to the model edges. These 
may be caused by shadows, for example, or strong background scene elements. 
Such outliers are a particular problem for the traditional least-squares fitting 
method used by many of the algorithms. Methods of improving robustness to 
these sorts of outliers include the use of RANSAC factored sampling PI or 
regularisation, for example the Levenberg-Marquadt scheme used in [1 .'i) . The 
approach used here employs iterative re-weighted least squares (a robust M- 
estimator) which is then extended to incorporate a number of additional saliency 
measures. This is discussed in more detail in Section 0 

There is a trade-off to be made between robustness and precision. The Con- 
densation system, for example, obtains a high degree of robustness by taking a 
large number of sample hypotheses of the position of the tracked structure with 
a comparatively small number of edge measurements per sample. By contrast, 
the system presented here uses a large number of measurements for a single po- 
sition hypothesis and is thus able to obtain very high precision in its positional 
estimates. This is particularly relevant in tasks such as visual servoing since the 
dynamics and environmental conditions can be controlled so as to constrain the 
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robustness problems, while high precision is needed in real-time in order for the 
system to be useful. 

Occlusion is also a significant cause of instabilities and may occur when the 
object occludes parts of itself (self occlusion) or where another object lies between 
the camera and the target (external occlusion) . RAPID handles the first of these 
problems by use of a pre-computed table of visible features indexed by what is 
essentially a view-sphere. By contrast, the system presented here uses graphical 
rendering techniques to dynamically determine the visible features and is thus 
able to handle more complex situations (such as objects with holes) than can be 
tabulated on a view-sphere. 

External occlusion can be treated by using outlier rejection, for example in 
which discards primitives for which insufficient support is found, or by modifying 
statistical descriptions of the observation model (as in PI). If a model is available 
for the intervening object, then it is possible to use this to re-estimate the visible 
features [iSI2 1 j . Both of these methods are used within the system presented here. 

1.2 Visual Servoing 

The use of visual feedback output by such tracking systems for robotic control is 
increasingly becoming an attractive proposition. A distinction is often made PI 
between image-based [Zj and position-based ^ visual servoing. The approach 
presented here is position-based but closes the control loop by projecting the 
action of three-dimensional camera motion into the image where it is fitted to 
image measurements. Since the eye-in-hand approach is used, this generates a 
motion-to-image Jacobian (also known as the interaction screw |Z1) which can 
be used to generate robot control commands to minimise the image error. 

2 Theoretical Framework 

The approach proposed here for tracking a known 3-dimensional structure is 
based upon maintaining an estimate of the camera projection matrix, P, in the 
co-ordinate system of the structure. This projection matrix is represented as the 
product of a matrix of internal camera parameters: 



and a Euclidean projection matrix representing the position and orientation of 
the camera relative to the target structure: 




( 1 ) 



E = [i? t] with RR^ = I and |i?| = 1 



( 2 ) 



The projective co-ordinates of an image feature are then given by 




( 3 ) 
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with the actual image co-ordinates given by 

(!) = (:;:) w 

Rigid motions of the camera relative to the target structure between con- 
secutive video frames can then be represented by right multiplication of the 
projection matrix by a Euclidean transformation of the form: 



These M, form a 4 x 4 matrix representation of the group SE(3) of rigid 
body motions in 3-dimensional space, which is a 6-dimensional Lie Group. The 
generators of this group are typically taken to be translations in the x, y and z 
directions and rotations about the x, y and z axes, represented by the following 
matrices: 
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These generators form a basis for the vector space (the Lie algebra) of deriva- 
tives of SE(3) at the identity. Consequently, the partial derivative of projective 
image co-ordinates under the zth generating motion can be computed as: 



with 




( 7 ) 

( 8 ) 



giving the motion in true image co-ordinates. A least squares approach can then 
be used to fit the observed motion of image features between adjacent frames. 
This process is detailed in Section HO 

This method can be extended to include the motion of image features due to 
the change in internal camera parameters ^ or internal model parameters by 
incorporating the vector fields they generate into the least squares process. 



2.1 Tracking Edges 

An important aspect of the approach presented here is the decision to track 
the edges of the model (which appear as intensity discontinuities in the video 
feed). Edges are strong features that can be reliably found in the image because 
they have a significant spatial extent. Furthermore, this means that a number of 
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Fig. 1. Computing the normal component of the motion 



measurements can be made along each edge, and thus they may be accurately 
localised within an image. 

This approach also takes advantage of the aperture problem (that the com- 
ponent of motion of an edge, tangent to itself, is not observable locally). This 
problem actually yields a substantial benefit since the search for intensity dis- 
continuities in the video image can be limited to a one dimensional path that 
lies along the edge normal, h (see Figure Q and thus has linear complexity in 
the search range, rather than quadratic as for a two-dimensional feature search. 
This benefit is what makes it possible to track complex structures in real time 
on a standard workstation without additional hardware. The normal component 
of the motion fields, Li are then also computed (as Li ■ fi) . 

3 Tracking System 

The three-dimensional tracking system makes use of constrained snake technol- 
ogy to the follow edges of the workpiece that are visible in the video image. 
One novel aspect of this work is the use of a real-time hidden-line-removal ren- 
dering system (using binary space partition trees [ISp to dynamically determine 
the visible features of the model in real-time. This technique allows accurate 
frame rate tracking of complex structures such as the ship part shown in Figure 

m 

Figure 0 shows system operation. At each cycle, the system renders the ex- 
pected view of the object (a) using its current estimate of the projection matrix, 
P. The visible edges are identified and tracking nodes are assigned at regular 
intervals in image co-ordinates along these edges (b). The edge normal is then 
searched in the video feed for a nearby edge (c). Typically m ^ 400 nodes are 
assigned and measurements made in this way. The system then projects this m- 
dimensional measurement vector onto the 6-dimensional subspace corresponding 
to Euclidean transformations (d) using the least squares approach described in 
Section 1,3. 31 to give the motion, M. The Euclidean part of the projection matrix. 
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Fig. 2. Image and CAD model of ship part 



E is then updated by right multiplication with this transformation (e). Finally, 
the new projection matrix P is obtained by multiplying the camera parameters 
K with the updated Euclidean matrix to give a new current estimate of the local 
position (f). The system then loops back to step (a). 

3.1 Rendering the Model 

In order to accurately render a CAD model of a complex structure such as the 
one shown in Figure 0 at frame rate, an advanced rendering technique such as 
the use of binary space partition trees is needed US]. This approach represents 
the object as a tree, in which each node contains the equation of a plane in the 
model, together with a list of edges and convex polygons in that plane. Each 
plane partitions 3-dimensional space into the plane and the two open regions 
either side of the plane. The two branches of the tree represent those parts of 
the model that fall into these two volumes. Thus the tree recursively partitions 
space into small regions which, in the limit, contain no remaining model features. 
The rendering takes place by performing an in-order scan of the tree, where at 




Fig. 3. Tracking system operation 
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Fig. 4. Tracking nodes assigned and distances measured 



each node, the viewpoint is tested to see if it lies in front, or behind the plane. 
When this is determined, those features lying closer to the camera are rendered 
first, then the plane itself, and finally, the more distant features. The use of a 
stencil buffer prevents over-writing of nearer features by more distant ones and 
also provides a layer map when the rendering is complete. The ship part contains 
12 planes, but since 8 of these (corresponding to the T and L beams) are split 
into two parts by a vertical plane partition, there are 20 nodes in the tree. 

3.2 Locating Edges 

Once rendering is complete, the layer map is used to locate the visible parts 
of each edge by comparing the assigned layer of the plane for each edge in the 
model with the layer in the stencil buffer at a series of points along that edge. 
Where the depths agree, trackers are assigned to search for the nearest edge in 
the video feed along the edge normal (see Figure EJ. 

The result of this process is a set of trackers with known position in the model 
co-ordinate system, with computed edge normals and the distance along those 
normals to the nearest image edge. Grouping these distances together provides 
an m-dimensional measurement vector. 



3.3 Computing the Motion 

Step (d) in the process involves the projection of the measurement vector onto 
the subspace defined by the Euclidean transformation group. The action of each 
of the generators of SE(3) on the tracking nodes in image co-ordinates can be 
found by computing PGi and applying this to the homogeneous co-ordinates 
of the node in 3-space. This can be projected to give a vector, describing 
the image motion of the ^th node for the ith generator of Euclidean motion 
of the object. ■ fi^ then describes the magnitude of the edge normal motion 
that would be observed in the image at each node for each group generator. 
These can be considered as a set of m-dimensional vectors which describe the 
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Fig. 5. Frames from tracking sequence 



motion in the image for each mode of Euclidean transformation. The system then 
projects the m-vector corresponding to the measured distances to the observed 
edges onto the subspace spanned by the transformation vectors. This provides a 
solution to finding the geometric transformation of the part which best fits the 
observed edge positions, minimising the square error between the transformed 
edge position and the actual edge position (in pixels). This process is performed 
as follows: 





(9) 


■ h«)(L« • h«) 


(10) 




(11) 



(with Einstein summation convention over Latin indices). The are then the 
coefficients of the vector in the Lie algebra representing the quantity of each 
mode of Euclidean motion that has been observed. The final step is to compute 
the actual motion of the model and apply it to the matrix E in ( 0 . This is done 
by using the exponential map relating to SE(3). 

Et+i = Et exp(E^a,G,) (12) 

This section has described the basic version of the tracking system. This 
system performs well over a wide range of configurations (see Figure 0. However, 
in order to improve robustness to occlusion and critical configurations which 
cause instability, saliency characteristics are introduced. 



4 Extended Iterative Re-weighted Least Squares 

The naive least squares algorithm presented in Section is vulnerable to in- 
stabilities caused by the presence of outliers. This is because the sum-of-squares 
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objective function can be significantly affected by a few measurements with 
large errors. Equivalently, the corresponding Gaussian distribution dies off far 
too quickly to admit many sample measurements at a large number of standard 
deviations. 

A standard technique for handling this problem is the substitution of a ro- 
bust M-estimator for least squares estimator by replacing the objective function 
with one that applies less weighting to outlying measurements m This can 
be achieved by modifying the least squares algorithm and replacing® and (cni) 
with: 



Oi = '^s{S)S{lI ■ fi«) 

F 


(13) 




(14) 


A common choice for the weighting function, s is: 






(15) 


which corresponds to replacing the Gaussian error distribution with one that lies 
between Gaussian and Laplacian. The parameter c is chosen here to be approx- 
imately one standard deviation of the inlying data. This approach is known as 
iterative re-weighted least squares (IRLS) since s depends on d, which changes 
with each iteration. In the current implementation, only a single iteration is per- 



formed for each frame of the video sequence and convergence occurs rapidly over 
sequential frames. Incorporating IRLS into the system improves its robustness 
to occlusion (see Figure El). The function s controls the confidence with which 
each measurement is fitted in the least squares procedure and thus can be viewed 
as representing the saliency of the measurement. 

This can be further exploited by extending IRLS by incorporating a number 
of additional criteria into the saliency estimate. These are chosen to improve 




Fig. 6. Frames from tracking sequence with occlusion 
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the robustness of the system when it is exposed to critical configurations which 
have been identified as causing instabilities. The saliency or re-weighting of each 
measurement is modified to include four additional terms. The first three of 
these terms address statistical saliency (can a feature be detected reliably?) while 
the fourth is concerned with analytical saliency (does the feature constrain the 
motion?). 

Multiple edges: When the tracker sees multiple edges within its search range, 
it is possible for the wrong one to be chosen. Typically many trackers on the 
same edge will do this, compounding the problem. To reduce this problem, 
the saliency is inversely proportional to the number of edge strength maxima 
visible within the search path. 

Many trackers disappear simultaneously: If a major edge is aligned along 
an image axis then it is possible for the entire edge to leave the field of view 
between two frames. This entails a sudden change in the set of trackers used 
and may cause a sudden apparent motion of the model. This sudden change 
in the behaviour of the tracker can be removed by constructing a border at 
the edge of the image. The saliency of nodes within this border is weakened 
linearly to zero as the pixel approaches the edge. A border of 40 pixels has 
been found to be sufficiently large for this purpose. 

Poor visibility: Generally the best measurements come from the strongest 
edges in the image, since weak edges may be difficult to locate precisely. 
This is taken into account by examining the edge strengths found in the 
search path. If the edge strength along a search path is below a threshold, 
no measurement is made for that node. Between this threshold and a higher 
threshold (equal to double the lower one), the saliency of the node is var- 
ied linearly. Above the higher threshold, the visibility does not affect the 
saliency. 

Weak conditioning: If the majority of the trackers belong to a single plane of 
the model (for example the feature rich front plane of the ship part) which 
is front on to the camera then the least squares matrix generated by these 
nodes becomes more weakly conditioned than in the general configuration. 
This can be improved by increasing the saliency of measurements that help 
to condition the least squares matrix. If the vector comprising the six image 
motions at node i lies in the subspace spanned by the eigen vectors of Cij 
corresponding to the smallest eigen values, then that node is particularly 
important in constraining the estimated motion. This is implemented by 
the simple expedient of doubling the saliency when (l| • h^)(L| • is 

greater than the geometric mean of that quantity, computed over the visible 
features in the image. 

These measures have been found to provide increased robustness and while 
they represent a heuristic method of dealing with critical configurations, the 
general approach of modifying the re-weighting function, s, provides a powerful 
method of incorporating domain knowledge within the least squares framework 
in a conceptually intuitive manner. 




Real-Time Tracking of Complex Structures for Visual Servoing 



79 




Fig. 7 . Visual servoing system operation 



5 Visual Servoing System 

The visual servoing system (shown in Figure^ takes the Euclidean matrix, E as 
output from the tracking system and uses this within a non-linear control law to 
provide feedback to servo the robot to a stored target position. These are learned 
by acquiring the Euclidean matrix with the robot placed in the target position 
by the supervisor. The inverse of this target matrix, , is easily computed and 
the product of this with the current position matrix yields the transformation 
from the target position to the current position (A). 

T = EE~^ (16) 

The translation and rotation vectors that must be applied to the robot are 
then easily extracted from this representation (B). (here fc = 1,2,3): 

ti = Ti4 

Ti= 2 ^ijkTjk 
jk 

_ r(sin-^(|r'|) 

\r'\ 

The vectors t and r are then multiplied by a gain factor and sent to the robot 
as end effector translation and rotation velocities (C). The gain is dependent 
on the magnitudes of t and r so that small velocities are damped to obtain 
higher precision, while large errors in position may be responded to quickly. A 
maximum velocity clamp is also applied for safety reasons and to prevent possible 
instabilities due to latency. Figures El and E| show the visual servoing system in 
action tracing a path between recorded waypoints and performing closed loop 
control tracking a moving part. 



(17) 

(18) 
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6 Results 

The tracking system and visual servoing system have been tested in a number of 
experiments to assess their performance both quantitatively and qualitatively. 
These experiments were conducted with an SGI 02 workstation (225 MHz) 
controlling a Mitsubishi RV-E2 robot. 



6.1 Stability of the Tracker 

The stability of the tracker with a stationary structure was measured to assess 
the effect of image noise on the tracker. The standard deviation of position and 
rotation as measured from the Euclidean matrix were measured over a run of 
100 frames. From a viewing distance of 30cm, the apparent rms translational 
motion was found to be 0.03mm with the rms rotation being 0.015 degrees. 



6.2 Accuracy of Positioning 

The accuracy of positioning the robot was measured with two experiments. 
Firstly, the ship part was held fixed and the robot asked to home to a given 
position from a number of different starting points. When the robot had ceased 
to move, the program was terminated and the robots position queried. The 
standard deviation of these positions was computed and the r.m.s. translational 
motion was 0.08mm with the r.m.s. rotation being 0.06 degrees. 

The second accuracy experiment was performed by positioning the ship part 
on an accurate turntable. The part was turned through fifteen degrees, one 
degree at a time and the robot asked to return to the target position each time. 
Again, the position of the robot was queried and a circle was fitted to the data. 
The residual error was computed and found to give an r.m.s. positional error of 
0.12mm per measurement (allowing for the three degrees of freedom absorbed 
into fitting the circle). 




Fig. 8. Closed-loop visual servoing. The task is to maintain a fixed spatial position 
relative to the workpiece. 
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Fig. 9. Visual servoing. The task is to trace out a trajectory relative to the workpiece. 



6.3 Closed Loop Tracking 

The qualitative performance of the tracking and servoing systems were assessed 
in a closed loop tracking experiment. The robot was asked to maintain a fixed 
position relative to the part whilst it was moved through a series of perturbations. 
This experiment is shown in Figure 0 

6.4 Saliency Enhancements 

The modifications to the least squares fitting procedure were tested by running 
two versions (with and without the modifications) of the tracking system con- 
currently. A series of ten experiments were conducted in which the model was 
moved through a configuration known to cause difficulties. Because two processes 
were running concurrently, the average frame rate attained by both was only 12 
frames per second. Of the ten experiments, the unmodified version lost track 
of the model on five occasions. The modified version successfully tracked on all 
ten experiments, although on two of these the tracking suffered a significant 
temporary deviation. 

6.5 Path Following 

The servoing system made to perform path following by recording the Euclidean 
matrices at a series of waypoints along the path. The system was then made to 
home to each of these in succession, moving on to the next node as soon as the 
current one had been reached. This experiment is shown in Figured! 

7 Conclusion 

This paper has introduced a real-time three-dimensional tracking system based 
on an active wire frame model which is rendered with hidden line removal. A for- 
mulation which uses the Lie algebra of the group of rigid body transformations 
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(SE(3)) to linearise the tracking problem has been presented and saliency mea- 
surements have been described which enhance the robustness of the tracker. A 
visual servoing system which uses the tracker has been implemented and results 
from a number of experiments presented. 
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Discussion 

Kenichi Kanatani: I’ve seen this kind of model tracking problem many times 
and each uses different techniques. You project the model edge and look at gray 
levels along the perpendicular to find discontinuities. Many other authors use 
edge images and look for edge pixels. What are the advantages of gray levels 
over edge pixels? 

Tom Drummond: Edges are a very popular approach, but there’s an enormous 
cost in processing the entire image to find edges first. All of the systems I know 
that do this rely on special image processing hardware like the DataCube. With 
gray levels, we only have to process at tiny fraction of the image pixels — say 400 
tracks of 40 pixels each — and we can do this with a standard Silicon Graphics 
workstation. 

Yongduek Seo: How precisely can you control the robot? 

Tom Drummond: There are two issues: global precision and repeatability. 
When we control the robot we teach it the position by putting the camera at the 
target viewpoint. It learns the view, and this gives us a very high repeatability. 
At a distance of about 25 cm we get about 0.1mm repeatability, because the 
camera just has to move to make the image identical to the training one. But 
that’s much higher than the global precision — from a different relative viewpoint 
the relative mesh of the tracker and real structure are slightly different and the 
precision is less. 




Direct Recovery of Planar-Parallax 
from Multiple Frames 



Michal Irani^, P. Anandan^, and Meir Cohen ^ 

^ Dept, of Computer Science and Applied Math, The Weizmann Inst, of Science, 

Rehovot, Israel, 

iraniOwisdom. weizmann. ac . il 

^ Microsoft Research, One Microsoft Way, Redmond, WA 98052, USA, 
anandanSmicrosof t . com 



Abstract. In this paper we present an algorithm that estimates dense 
planar-parallax motion from multiple uncalibrated views of a 3D scene. 

This generalizes the “plane -|- parallax” recovery methods to more than 
two frames. The parallax motion of pixels across multiple frames (relative 
to a planar surface) is related to the 3D scene structure and the camera 
epipoles. The parallax field, the epipoles, and the 3D scene structure 
are estimated directly from image brightness variations across multiple 
frames, without pre-computing correspondences. 

1 Introduction 

The recovery of the 3D structure of a scene and the camera epipolar-geometries 
(or camera motion) from multiple views has been a topic of considerable research. 
The large majority of the work on structure-from-motion (SFM) has assumed 
that correspondences between image features (typically a sparse set of image 
points) is given, and focused on the problem of recovering SFM based on this 
input. Another class of methods has focused on recovering dense 3D structure 
from a set of dense correspondences or an optical flow field. While these have 
the advantage of recovering dense 3D structure, they require that the correspon- 
dences are known. However, correspondence (or flow) estimation is a notoriously 
difficult problem. 

A small set of techniques have attempted to combine the correspondence 
estimation step together with SFM recovery. These methods obtain dense cor- 
respondences while simultaneously estimating the 3D structure and the camera 
geometries (or motion) |,3f I I f 1 ,3f I tip 1 5j . By inter-weaving the two processes, the 
local correspondence estimation process is constrained by the current estimate of 
(global) epipolar geometry (or camera motion) , and vice-versa. These techniques 
minimize the violation of the brightness gradient constraint with respect to the 
unknown structure and motion parameters. Typically this leads to a significant 
improvement in the estimated correspondences (and the attendant 3D structure) 
and some improvement in the recovered camera geometries (or motion). These 
methods are sometimes referred to as “direct methods” P], since they directly 
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use image brightness information to recover 3D structure and motion, without 
explicitly computing correspondences as an intermediate step. 

While | |3I1 fill 5| recover 3D information relative to a camera- centered coordi- 
nate system, an alternative approach has been proposed for recovering 3D struc- 
ture in a scene-centered coordinate system. In particular, the “Plane-I-Parallax” 
approach [1411 1l13l7lhl8| . which analyzes the parallax displacements of points 
relative to a (real or virtual) physical planar surface in the scene (the “refer- 
ence plane” ) . The underlying concept is that after the alignment of the reference 
plane, the residual image motion is due only to the translational motion of the 
camera and to the deviations of the scene structure from the planar surface. All 
effects of camera rotation or changes in camera calibration are eliminated by 
the plane stabilization. Hence, the residual image motion (the planar-parallax 
displacements) form a radial flow field centered at the epipole. 

The “Plane-|-Parallax” representation has several benefits over the traditional 
camera-centered representation, which make it an attractive framework for cor- 
respondence estimation and for 3D shape recovery: 



1. Reduced search space: By parametrically aligning a visible image struc- 

ture (which usually corresponds to a planar surface in the scene), the search 
space of unknowns is significantly reduced. Globally, all effects of unknown 
rotation and calibration parameters are folded into the homographies used 
for patch alignment. The only remaining unknown global camera parame- 
ters which need to be estimated are the epipoles (i.e., 3 global unknowns 
per frame; gauge ambiguity is reduced to a single global scale factor for all 
epipoles across all frames). Locally, because after plane alignment the un- 
known displacements are constrained to lie along radial lines emerging from 
the epipoles, local correspondence estimation reduces from a 2-D search prob- 
lem into a simpler 1-D search problem at each pixel. The 1-D search problem 
has the additional benefit that it can uniquely resolve correspondences, even 
for pixels which suffer from the aperture problem (i.e., pixels which lie on 
line structures). 

2. Provides shape relative to a plane in the scene: In many applications, 

distances from the camera are not as useful information as fluctuations with 
respect to a plane in the scene. For example, in robot navigation, heights 
of scene points from the ground plane can be immediately translated into 
obstacles or holes, and can be used for obstacle avoidance, as opposed to 
distances from the camera. 

3. A compact representation: By removing the mutual global component (the 
plane homography), the residual parallax displacements are usually very 
small, and hence require significantly fewer bits to encode the shape fluc- 
tuations relative to the number of bits required to encode distances from 
the camera. This is therefore a compact representation, which also supports 
progressive encoding and a high resolution display of the data. 

4. A stratified 2D-3D representation: Work on motion analysis can be roughly 
classified into two classes of techniques: 2D algorithms which handle cases 
with no 3D parallax (e.g., estimating homographies, 2D affine transforma- 
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tions, etc), and 3D algorithms which handle cases with dense 3D parallax 
(e.g., estimating fundamental matrices, trifocal tensors, 3D shape, etc). Prior 
model selection m is usually required to decide which set of algorithms to 
apply, depending on the underlying scenario. The Plane-|-Parallax repre- 
sentation provides a unified approach to 2D and 3D scene analysis, with 
a strategy to gracefully bridge the gap between those two extremes m- 
Within the Plane-|-Parallax framework, the analysis always starts with 2D 
estimation (i.e., the homography estimation). When that is all the informa- 
tion available in the image sequence, that is where the analysis stops. The 
3D analysis then gradually builds on top of the 2D analysis, with the gradual 
increase in 3D information (in the form of planar-parallax displacements and 
shape-fluctuations w.r.t. the planar surface). 

m used the Plane-|-Parallax framework to recover dense structure rela- 
tive to the reference plane from two uncalibrated views. While their algorithm 
linearly solves for the structure directly from brightness measurements in two 
frames, it does not naturally extend to multiple frames. In this paper we show 
how dense planar-parallax displacements and relative structure can be recov- 
ered directly from brightness measurements in multiple frames. Furthermore, we 
show that many of the ambiguities existing in the two-frame case of [TTITT^ are 
resolved by extending the analysis to multiple frames. Our algorithm assumes 
as input a sequence of images in which a planar surface has been previously 
aligned with respect to a reference image (e.g., via one of the 2D parametric 
estimation techniques, such as TO). We do not assume that the camera calibra- 
tion information is known. The output of the algorithm is: (i) the epipoles for 
all the images with respect to the reference image, (ii) dense 3D structure of the 
scene relative to a planar surface, and (iii) the correspondences of all the pixels 
across all the frames, which must be consistent with (i) and (ii). The estimation 
process uses the exact equations (as opposed to instantaneous equations, such as 
in gng) relating the residual parallax motion of pixels across multiple frames to 
the relative 3D structure and the camera epipoles. The 3D scene structure and 
the camera epipoles are computed directly from image measurements by mini- 
mizing the variation of image brightness across the views without pre-computing 
a correspondence map. 

The current implementation of our technique relies on the prior alignment of 
the video frames with respect to a planar surface (similar to other plane-l-parallax 
methods) . This requires that a real physical plane exists in the scene and is visi- 
ble in all the video frames. However, this approach can be extended to arbitrary 
scenes by folding in the plane homography computation also into the simultane- 
ous estimation of camera motion, scene structure, and image displacements (as 
was done by HH for the case of two frames) . 

The remainder of the paper describes the algorithm and shows its perfor- 
mance on real and synthetic data. Section El shows how the 3D structure relates 
to the 2D image displacement under the plane-l-parallax decomposition. Sec- 
tion El outlines the major steps of our algorithm. The benefits of applying the 
algorithm to multiple frames (as opposed to two frames) are discussed in Sec- 



M. Irani, P. Anandan, and M. Cohen 



tion 0 Section 0 shows some results of applying the algorithm to real data. 
Section El concludes the paper. 

2 The Plane+Parallax Decomposition 

The induced 2D image motion of a 3D scene point between two images can be 
decomposed into two components a) the image motion of 

a reference planar surface U (i.e., a homography), and (ii) the residual image 
motion, known as “planar parallax” . This decomposition is described below. 

To set the stage for the algorithm described in this paper, we begin with the 
derivation of the plane+parallax motion equations shown in [1 Oj . Let p = (cc, y, 1) 
denote the image location (in homogeneous coordinates) of a point in one view 
(called the “reference view”), and let p' = {x' , y', 1) be its coordinates in another 
view. Let B denote the homography of the plane U between the two views. Let 
denote its inverse homography, and B^^a be the third row of B~^. Let 
Pw = {xw,yw, 1) = , namely, when the second image is warped towards 

the first image using the inverse homography B~^, the point p' will move to the 
point Pw in the warped image. For 3D points on the plane 7T, Pw = p, while for 
3D points which are not on the plane, Pw ^ P- It was shown in [10! thalS 

p' -p^(p> - p^) + {jp^ - p) 



and 

Pro-P = -lihPiv - t) (1) 

where ^ = HjZ represents the 3D structure of the point p, where H is the per- 
pendicular distance (or ’’height”) of the point from the reference plane il, and 
Z is its depth with respect to the reference camera. All unknown calibration pa- 
rameters are folded into the terms in the parenthesis, where t denotes the epipole 
in projective coordinates and denotes its third component: t = 

In its current form, the above expression cannot be directly used for estimat- 
ing the unknown correspondence Pw for a given pixel p in the reference image, 
since Pw appears on both sides of the expression. However, Pw can be eliminated 
from the right hand side of the expression, to obtain the following expression: 

- P= - 7 , (tsp-t). (2) 

1 + 7^3 

This last expression will be used in our direct estimation algorithm. 

3 Multi-frame Parallax Estimation 

Let { j be ^ -I- 1 images of a rigid scene, taken using cameras with unknown 
calibration parameters. Without loss of generality, we choose <l>o as a reference 

^ The notation we use here is slightly different than the one used in [10] . The change 
to projective notation is used to unify the two separate expressions provided in m, 
one for the case of a finite epipole, and the other for the case of an infinite epipole. 
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frame. (In practice, this is usually the middle frame of the sequence). Let U be 
a plane in the scene that is visible in all I images (the “reference plane” ) . Using 
a technique similar to HE], we estimate the image motion (homography) of U 
between the reference frame $o and each of the other frames {j = 1, ... ,l ). 
Warping the images by those homographies yields a new sequence of 

I images, {Ij}j^i, where the image of II is aligned across all frames. Also, for 
the sake of notational simplicity, let us rename the reference image to be J, i.e., 
I = I>o. The only residual image motion between reference frame I and the 
warped images, is the residual planar-parallax displacement — p 

{j = 1..1) due to 3D scene points that are not located on the reference plane II. 
This residual planar parallax motion is what remains to be estimated. 

Let denote the first two coordinates of — p (the third 

coordinate is 0). From Eq. (|2|l we know that the residual parallax is: 





_ T' 


t^x — t{ 


yj 


1 + jti 


.tiy-ti. 



where the superscripts j denote the parameters associated with the jth frame. 

In the two-frame case, one can define a = and then the problem 

posed in Eq. m becomes a bilinear problem in a and in t = This 

can be solved using a standard iterative method. Once a and t are known, 7 
can be recovered. A similar approach was used in El for shape recovery from 
two-frames. However, this approach does not extend to multiple (> 2) frames, 
because a is not a shape invariant (as it depends on t^), and hence varies from 
frame to frame. In contrast, j is a, shape invariant, which is shared by all image 
frames. Our multi-frame process directly recovers 7 from multi-frame brightness 
quantities. 

The basic idea behind our direct estimation algorithm is that rather than 
estimating I separate vectors (corresponding to each frame) for each pixel, 
we can simply estimate a single 7 (the shape parameter), which for a particular 
pixel, is common over all the frames, and a single P — (U, O) which for each 
frame Ij is common to all image pixels. There are two advantages in doing this: 

1. For n pixels over I frames we reduce the number of unknowns from 2nl to 
n 3L 

2. More importantly, the recovered flow vector is constrained to satisfy the 
epipolar structure implicitly captured in Eq. (]2|). This can be expected to 
significantly improve the quality of the recovered parallax flow vectors. 

Our direct estimation algorithm follows the same computational framework 
outlined in Q for the quasi-parametric class of models. The basic components of 
this framework are: (i) pyramid construction, (ii) iterative estimation of global 
(motion) and local (structure) parameters, and (iii) coarse-to-fine refinement. 
The overall control loop of our algorithm is therefore as follows: 
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1 . Construct pyramids from each of the images Ij and the reference frame I. 

2. Initialize the structure parameter 7 for each pixel, and motion parameter P 
for each frame (usually we start with 7 = 0 for all pixels, and P = (0,0,1)^ 
for all frames). 

3. Starting with the coarsest pyramid level, at each level, refine the structure 
and motion using the method outlined in Section tt. II 

4. Repeat this step several times (usually about 4 or 5 times per level). 

5. Project the final value of the structure parameter to the next finer pyramid 
level. Propagate the motion parameters also to the next level. Use these as 
initial estimates for processing the next level. 

6. The final output is the structure and the motion parameters at the finest 
pyramid level (which corresponds to the resolution of the input images) and 
the residual parallax flow field synthesized from these. 

Of the various steps outline above, the pyramid construction and the projec- 
tion of parameters are common to many techniques for motion estimation (e.g., 
see P3), hence we omit the description of these steps. On the other hand, the 
refinement step is specific to our current problem. This is described next. 

3.1 The Estimation Process 

The inner loop of the estimation process involves refining the current values of 
the structure parameters 7 (one per pixel) and the motion parameters P (3 
parameters per frame). Let us denote the “true” (but unknown) values of these 
parameters by ^{x,y) (at location {x,y) in the reference frame) and P. Let 
u^{x,y) = (u\v^) denote the corresponding unknown true parallax flow vector. 
Let 7ci^c>'*^c denote the current estimates of these quantities. Let ^7 = 7 — 7^, 
SP — {St{,St^ 2 ^ ~^c> ^"^d Su^ — {Su^ ,Sv^) = —u^^. These 5 quantities 

are the refinements that are estimated during each iteration. 

Assuming brightness constancy (namely, that corresponding image points 
across all frames have a similar brightness value ) 0 , we have: 

I{x,y) ~ Ij {x^ ,y^) = I j{x + u^y + v^) = I j{x + + 5u^ ,y + + 6v ^ ) 

For small Su^ we make a further approximation: 

I{x — 5u\ y — Sv^) ~ Ij{x + ui, y + v{). 

Expanding I to its first order Taylor series around (x, y) : 

I(x — Su ^ , y — Sv^) « I {x, y) — IxM — lySv^ 

^ Note that over multiple frames the brightness will change somewhat, at least due to 
global illumination variation. We can handle this by using the Laplacian pyramid 
(as opposed to the Gaussian pyramid), or otherwise pre- filtering the images (e.g., 
normalize to remove global mean and contrast changes) , and applying the brightness 
constraint to the filtered images. 
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where Ix,Iy denote the image intensity derivatives for the reference image (at 
pixel location (x,y)). From here we get the brightness constraint equation: 

Ij{x + ui, y + v{) ~ /(x, y) — IxSu^ — lySv^ 



Or: 

Ij{x + ul,y + vi) — I(x, y) + Ix5u^ + lySv^ 0 
Substituting du^ — — itj yields: 

Ij{x + ui,y + vl) - I{x,y) + Ix{u^ - ul) + Iy{v^ - 0 

Or, more compactly: 

/J (x, y) + IxU^ + lyV^ ft! 0 (4) 

where 

IJ {x, y) Ij(x + ui,y + vi) - I{x, y) - Ixu). - lyvi 

If we now substitute the expression for the local parallax flow vector given 
in Eq. (D. we obtain the following equation that relates the structure and motion 
parameters directly to image brightness information: 

- ^i) + - ^ 2 )) « 0 (5) 

We refer to the above equation as the “epipolar brightness constraint” . 

Each pixel and each frame contributes one such equation, where the un- 
knowns are: the relative scene structure 7 = 'y{x,y) for each pixel (x,y), and 
the epipoles P for each frame {j = 1,2, ... ,1). Those unknowns are computed in 
two phases. In the first phase, the “Local Phase”, the relative scene structure, 7, 
is estimated separately for each pixel via least squares minimization over mul- 
tiple frames simultaneously. This is followed by the “Global Phase”, where all 
the epipoles are estimated between the reference frame and each of the other 
frames, using least squares minimization over all pixels. These two phases are 
described in more detail below. 



Local Phase In the local phase we assume all the epipoles are given (e.g., 
from the previous iteration), and we estimate the unknown scene structure 7 
from all the images. 7 is a local quantity, but is common to all the images 
at a point. When the epipoles are known (e.g., from the previous iteration), 
each frame Ij provides one constraint of Eq. m on 7. Therefore, theoretically, 
there is sufficient geometric information for solving for 7. However, for increased 
numerical stability, we locally assume each 7 is constant over a small window 
around each pixel in the reference frame. In our experiments we used a 5 x 5 
window. For each pixel (x,j/), we use the error function: 



Err (7) =*' X; 



E 

(£,y)GWin(fc,y) 



■(1 + 74) + 7 (jxitix - t{) + 4(4y - 4)) 



( 6 ) 
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where 7 = 7(0;,?/), IJ = IJ{x,y), 4 = Ix{x,y), ly = Iy{x,y), and Win(a;,y) is a 
5 x 5 window around {x, y). Differentiating Err(^) with respect to 7 and equating 
it to zero yields a single linear equation that can be solved to estimate 'y{x,y). 
The error term Err{'y) was obtained by multiplying Eq. (0 by the denominator 
(1 + 7tg) to yield a linear expression in 7. Note that without multiplying by the 
denominator, the local estimation process (after differentiation) would require 
solving a polynomial equation in 7 whose order increases with I (the number of 
frames). Minimizing Err(^) is in practice equivalent to applying weighted least 
squares minimization on the collection of original Eqs. 0, with weights equal 
to the denominators. We could apply normalization weights — (where 7c 

is the estimate of the shape at pixel (x, y) from the previous iteration) to the 
linearized expression, in order to assure minimization of meaningful quantities 
(as is done in PSI). but in practice, for the examples we used, we found it was 
not necessary to do so during the local phase. However, such a normalization 
weight was important during the global phase (see below). 



Global Phase In the global phase we assume the structure 7 is given (e.g., 
from previous iteration), and we estimate for each image Ij the position of its 
epipole with respect to the reference frame. We estimate the set of epipoles 
{t^} by minimizing the following error with respect each of the epipoles: 



Err{t^) 



1 \ 2 



^ (w,(x,y) /J(l + 74) + 7(^4(t^x-t{)+/y(4?/-t^)) ) 



( 7 ) 

where = Ix{x,y),Iy = Iy{x,y),IJ = IJ{x,y),"f = j{x,y). Note that, when 
j(x, y) are fixed, this minimization problem decouples into a set of separate in- 
dividual minimization problems, each a function of one epipole for the jth 
frame. The inside portion of this error term is similar to the one we used above 
for the local phase, with the addition of a scalar weight Wj{x,y). The scalar 
weight is used to serve two purposes. First, if Eq. (Q did not contain the weights 
Wj{x,y), it would be equivalent to a weighted least squares minimization of 
Eq. 0 , with weights equal to the denominators (1 + 'y{x,y)t^^). While this pro- 
vides a convenient linear expression in the unknown these weights are not 
physically meaningful, and tend to skew the estimate of the recovered epipole. 
Therefore, in a fashion similar to CHI, we choose the weights Wj{x,y) to be 
(1 -I- 7(x, y)t3 ^)“^, where the 7 is the updated estimate from the local phase, 

whereas the tg ^ is based on the current estimate of E (from the previous itera- 
tion). 

The scalar weight also provides us an easy way to introduce additional ro- 
bustness to the estimation process in order to reduce the contribution of pixels 
that are potentially outliers. For example, we can use weights based on residual 
misalignment of the kind used in |S|. 
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4 Multi-frame vs. Two-Prame Estimation 

The algorithm described in Section 0 extends the plane-|-parallax estimation 
to multiple frames. The most obvious benefit of multi- frame processing is the 
improved signal-to-noise performance that is obtained due to having a larger 
set of independent samples. However, there are two additional benefits to multi- 
frame estimation: (i) overcoming the aperture problem, from which the two-frame 
estimation often suffers, and (ii) resolving the singularity of shape recovery in 
the vicinity of the epipole (we refer to this as the epipole singularity) . 

4.1 Eliminating the Aperture Problem 

When only two images are used as in nmsi, there exists only one epipole. The 
residual parallax lies along epipolar lines (centered at the epipole, see Eq. 0). 
The epipolar field provides one line constraint on each parallax displacement, 
and the Brightness Constancy constraint forms another line constraint (Eq. (0). 
When those lines are not parallel, their intersection uniquely defines the parallax 
displacement. However, if the image gradient at an image point is parallel to the 
epipolar line passing through that point, then its parallax displacement (and 
hence its structure) can not be uniquely determined. However, when multiple 
images with multiple epipoles are used, then this ambiguity is resolved, because 
the image gradient at a point can be parallel to at most one of the epipolar lines 
associated with it. This observation was also made by f4ll5j . 

To demonstrate this, we used a sequence composed of 9 images (105 x 105 
pixels) of 4 squares (30 x 30 pixels) moving over a stationary textured background 
(which plays the role of the aligned reference plane) . The 4 squares have the same 
motion: first they were all shifted to the right (one pixel per frame) to generate 
the first 5 images, and then they were all shifted down (one pixel per frame) to 
generate the next 4 images. The width of the stripes on the squares is 5 pixels. 
A sample frame is shown in Fig. n]a (the fifth frame). 

The epipoles that correspond to this motion are at infinity, the horizontal 
motion has an epipole at (oo,52.5], and the vertical motion has an epipole at 
[52.5, oo). The texture on the squares was selected so that the spatial gradients of 
one square are parallel to the direction of the horizontal motion, another square 
has spatial gradients parallel to the direction of the vertical motion, and the two 
other squares have spatial gradients in multiple directions. We have tested the 
algorithm on three cases: (i) pure vertical motion, (ii) pure horizontal motion, 
and (iii) mixed motions. 

Fig. nib is a typical depth map that results from applying the algorithm to 
sequences with purely vertical motion. (Dark grey corresponds to the reference 
plane, and light grey corresponds to elevated scene parts, i.e., the squares). The 
structure for the square with vertical bars is not estimated well as expected, 
because the epipolar constraints are parallel to those bars. This is true even 
when the algorithm is applied to multiple frames with the same epipole. 
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Fig. 1. Resolving aperture problem: (a) A sample image, (b) Shape recovery for pure 
vertical motion. Ambiguity along vertical bars, (c) Shape recovery for pure horizontal 
motion. Ambiguity along horizontal bars, (d) Shape recovery for a sequence with mixed 
motions. No ambiguity. 



Fig. IHc is a typical depth map that results from applying the algorithm to 
sequences with purely horizontal motion. Note that the structure for the square 
with horizontal bars is not estimated well. 

Fig. dd is a typical depth map that results from applying the algorithm to 
multiple images with mixed motions (i.e., more than one distinct epipole). Note 
that now the shape recovery does not suffer from the aperture problem. 



4.2 Epipole Singularity 

From the planar parallax Eq. 0 , it is clear that the structure 7 cannot be 
determined at the epipole, because at the epipole: — t{ = 0 and = 

0. For the same reason, the recovered structure at the vicinity of the epipole 
is highly sensitive to noise and unreliable. However, when there are multiple 
epipoles, this ambiguity disappears. The singularity at one epipole is resolved 
by information from another epipole. 

To test this behavior, we compared the results for the case with only one 
epipole (i.e., two- frames) to cases with multiple epipoles at different locations. 
Results are shown in Fig. 0 The sequence that we used was composed of images 
of a square that is elevated from a reference plane and the simulated motion 
(after plane alignment) was a looming motion (i.e., forward motion). Fig.|21a,b,c 
show three sample images from the sequence. Fig. 0d shows singularity around 
the epipole in the two-frame case. Figs. 0e,h,i,j show that the singularity at 
the epipoles is eliminated when there is more than one epipole. Using more 
images also increases the signal to noise ratio and further improves the shape 
reconstruction. 



5 Real World Examples 

This section provides experimental results of applying our algorithm to real world 
sequences. Fig. 0 shows an example of shape recovery from an indoor sequence 
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Fig. 2. Resolving epipole singularity in case of multiple epipoles. (a-c) sample images 
from a 9-frame sequence with multiple epipoles, (d,f) shape recovery using 2 images 
(epipole singularity exist in this case), (e,g) using 3 images with 2 different epipoles, 
(h,k) using 5 images with multiple epipoles, (i,l) using 7 images with multiple epipoles, 
(j,m) using 9 images with multiple epipoles. Note that epipole singularity disappears 
once multiple epipoles exist. (f,g,k,l,m) show an enlarge view of the depth image at the 
vicinity of the epipoles. The box shows the region where the epipoles are. For visibility 
purposes, different images are shown at different scales. For reference, coordinate rulers 
are attached to each image. 
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Fig. 3. Blocks sequence, (a) one frame from the sequence, (b) The recovered shape 
(relative to the carpet). Brighter values correspond to taller points. 





(a) 



(b) 



Fig. 4. Flower-garden sequence, (a) one frame from the sequence, (b) The recovered 
shape (relative to the facade of the house). Brighter values correspond to points farther 
from the house. 



(the “block” sequence from The reference plane is the carpet. Fig. Ola 

shows one frame from the sequence. Fig. 01b shows the recovered structure. 
Brighter grey levels correspond to taller points relative to the carpet. Note the 
fine structure of the toys on the carpet. 

Fig. OJshows an example of shape recovery for a sequence of five frames (part 
of the flower garden sequence). The reference plane is the house. Fig. 0a shows 
the reference frame from the sequence. Fig. 0b shows the recovered structure. 
Note the gradual change of depth in the field. 
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Fig. 5. Stairs sequence, (a) one frame from the sequence, (b) The recovered shape 
(relative to the ground surface just in front of the building). Brighter values correspond 
to points above the ground surface, while darker values correspond to points below the 
ground surface. 



Fig. 0 shows an example of shape recovery for a sequence of 5 frames. The 
reference plane is the flat region in front of the building. Fig. 0a show one 
frame from the sequence. Fig. 0b shows the recovered structure. The brightness 
reflects the magnitude of the structure parameter 7 (brighter values correspond 
to scene points above the reference plane and darker values correspond to scene 
points below the reference plane). Note the fine structure of the stairs and the 
lamp-pole. The shape of the building wall is not fully recovered because of lack 
of texture in that region. 

6 Conclusion 

We presented an algorithm for estimating dense planar-parallax displacements 
from multiple uncalibrated views. The image displacements, the 3D structure, 
and the camera epipoles, are estimated directly from image brightness variations 
across multiple frames. This algorithm extends the two-frames plane-l-parallax 
estimation algorithm of II 111 31 to multiple frames. The current algorithm re- 
lies on prior plane alignment. A natural extension of this algorithm would be 
to fold the homography estimation into the simultaneous estimation of image 
displacements, scene structure, and camera motion (as was done by m for two 
frames). 
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Discussion 

Rick Szeliski: How do you initialize? - You have a sort of back and forth, two 
phase method for solving a bilinear problem. But when you have your first plane 
stabilized in a set of images, how do you guess the initial epipole or depth-map? 
Michal Irani: We start with a zero depth map, and for the epipoles we try five 
different positions, one in the centre and one in each of the four quadrants. This 
does provide a good enough initialization — even in the case where the epipoles 
are at infinity we converge to the correct solution. 

Bill Triggs: Just a comment on a comment you made about linearized methods 
being stabler than fully nonlinear ones. Assuming that the nonlinear method 
optimizes the statistically correct error model, its stability is by definition the 
true stability of the problem. If a linear method appears stabler it must be 
because it’s either estimating a simplified model, or biased. So linear methods 
are not intrinsically stabler, they’re just more often allowed to give wrong results. 

Michal Irani: Well, linear algorithms are much simpler, and when their ap- 
proximations are valid, they don’t give the wrong results. So that’s exactly the 
question — when are they valid, because when they are you would like to use 
them. The case I was talking about — the intermediate approximation where 
the global component of the homography is exact and only the local component 
is approximated — turns out to be valid in many, many cases. That’s what 
we’re checking right now. It has the potential to produce very simple algorithms 
without making any severe assumptions, whereas the original Longuet-Higgins 
approximation was very restrictive. 

P. Anandan: I want to make a comment on Bill’s comment. I think you see a 
similar thing about nonlinear methods being unstable in the work on encoding 
epipolar geometry. I recall Adiv’s work, where at each iteration you normalize 
by the current depth to make the flow look more like the exact equation. So 
in some sense, by using weights based on the current estimate, you reduce the 
error introduced by the linear approximation during the iterative process. I’m 
not sure whether linear methods with varying weights should count as linear 
or nonlinear for stability. The same issues come up in correspondence based 
methods for structure-from- motion as well. 

Michal Irani: I’m not sure whether this is what you meant Anandan, but gen- 
erally when you solve a nonlinear iteration step you make some approximations 
that may not be correct. It’s better to start with valid approximations than to 
start with bad ones. So, when you’re assuming linear models, at least you know 
which approximations you’re making. 
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Abstract. Image-based reconstruction from randomly scattered views is a 
challenging problem. We present a new algorithm that extends Seitz and Dyer’s 
Voxel Coloring algorithm. Unlike their algorithm, ours can use images from 
arbitrary camera locations. The key problem in this class of algorithms is that of 
identifying the images from which a voxel is visible. Unlike Kutulakos and 
Seitz’s Space Carving technique, our algorithm solves this problem exactly and 
the resulting reconstructions yield better results in our application, which is 
synthesizing new views. One variation of our algorithm minimizes color 
consistency comparisons; another uses less memory and can be accelerated with 
graphics hardware. We present efficiency measurements and, for comparison, 
we present images synthesized using our algorithm and Space Carving. 



1 Introduction 

We present a new algorithm for volumetric scene reconstruction. Specifically, given 
a handful of images of a scene taken from arbitrary but known locations, the 
algorithm builds a 3D model of the scene that is consistent with the input images. We 
call the algorithm Generalized Voxel Coloring or GVC. Like two earlier solutions to 
the same problem, Voxel Coloring [1] and Space Carving [2], our algorithm uses 
voxels to model the scene and exploits the fact that surface points in a scene, and 
voxels that represent them, project to consistent colors in the input images. Although 
Voxel Coloring and Space Carving are particularly successful solutions to the scene 
reconstruction problem, our algorithm has advantages over each of them. Unlike 
Voxel Coloring, GVC allows input cameras to be placed at arbitrary locations in and 
around the scene. This is why we call it Generalized Voxel Coloring. When checking 
the color consistency of a voxel, GVC uses the entire set of images from which the 
voxel is visible. Space Carving usually uses only a subset of those images. Using full 
visibility during reconstruction yields better results in our application, which is 
synthesizing new views. 

New-view synthesis, a problem in image-based rendering, aims to produce new 
views of a scene, given a handful of existing images. Conventional computer graphics 
typically creates new views by projecting a manually generated 3D model of the 
scene. Image-based rendering has attracted considerable interest recently because 
images are so much easier to acquire than 3D models. We solve the new-view 
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synthesis problem by first using GVC to create a model of the scene and then 
projecting the model to the desired viewpoint. 

We first describe earlier solutions to the new-view synthesis and volumetric 
reconstruction problems. We compare the merits of those solutions to our own. Next, 
we discuss GVC in detail. We have implemented two versions of GVC. One uses 
layered depth images (LDIs) to eliminate unnecessary color consistency comparisons. 
The other version makes more efficient use of memory. Next, we present our 
experimental results. We compare our two implementations with Space Carving in 
terms of computational efficiency and quality of synthesized images. Finally, we 
describe our future work. 



2 Related Work 

View Morphing [3] and Light Fields [4] are solutions to the new-view synthesis 
problem that do not create a 3D model as an intermediate step. View Morphing is one 
of the simplest solutions to the problem. Given two images of a scene, it uses 
interpolation to create a new image intermediate in viewpoint between the input 
images. Because View Morphing uses no 3D information about the scene, it cannot in 
general render images that are strictly correct, although the results often look 
convincing. Most obviously, the algorithm has limited means to correctly render 
objects in the scene that occlude one another. 

Lumigraph [5] and Light Field techniques use a sampling of the light radiated in 
every direction from every point on the surface of a volume. In theory, such a 
collection of data can produce nearly perfect new views. In practice, however, the 
amount of input data required to synthesize high quality images is far greater than 
what we use with GVC and is impractical to capture and store. Concentric Mosaics 
[6] is a similar technique that makes the sampling and storage requirements more 
practical by restricting the range over which new views may be synthesized. These 
methods have an advantage over nearly all competing approaches: they treat view- 
dependent effects, like refraction and specular reflections, correctly. 

Stereo techniques [7, 8] find points in two or more input images that correspond to 
the same point in the scene. They then use knowledge of the camera locations and 
triangulation to determine the depth of the scene point. Unfortunately, stereo is 
difficult to apply to images taken from arbitrary viewpoints. If the input viewpoints 
are far apart, then corresponding image points are hard to find automatically. On the 
other hand, if the viewpoints are close together, then small measurement errors result 
in large errors in the calculated depths. Furthermore, stereo naturally produces a 2D 
depth map and integrating many such maps into a true 3D model is a challenging 
problem [9]. 

Roy and Cox [10] and Szeliski and Golland [11] have developed variations of 
stereo that, in one respect, resemble voxel coloring: they project discrete 3D grid 
points into an arbitrary number of images to collect correlation or color variance 
statistics. Roy and Cox impose a smoothness constraint both along and across 
epipolar lines, which produces better reconstructions compared with conventional 
stereo. A major shortcoming of their algorithm is that it does not model occlusion. 

Szeliski and Golland’ s algorithm models occlusion but not as generally as GVC. 
Specifically, their scheme for finding an initial set of occluding gridpoints is unlikely 
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to work when the cameras surround the scene. However, the authors are ambitious in 
recovering fractional opacity and correct color for voxels whose projections in the 
images span occlusion boundaries — a goal we have not attempted. 

Faugeras and Keriven [12] have produced impressive reconstructions by applying 
variational methods within a level set formulation. Surfaces, initially larger than the 
scenes, are refined using PDFs to successively better approximations of the scenes. 
Like GVC, their method can employ arbitrary numbers of images, account for 
occlusion correctly, and deduce arbibrary topologies. It is not clear under what 
conditions their method converges or whether it can be easily extended to use color 
images. Neither Szeliski and Gotland nor Faugeras and Keriven have provided 
runtime and memory statistics so it is not clear if their methods are practical for 
reconstructing large scenes. 

Voxel Coloring, Space Carving, and GVC all exploit the fact that points on 
Lambertian surfaces are color-consistent — they project onto similar colors in all the 
images from which they are visible. These methods start with an arbitrary number of 
calibrated* images of the scene and a set of voxels that is a superset of the scene. Each 
voxel is projected into the images from which it is visible. If the voxel projects onto 
inconsistent colors in several images, it must not be on a surface and, so, it is 
carved — ^that is, declared to be transparent. Otherwise, the voxel is colored, i.e., 
declared to be opaque and assigned the color of its projections. These algorithms stop 
when all the opaque voxels project into consistent colors in the images. Because the 
final set of opaque voxels is color-consistent, it is a good model of the scene. 

Voxel Coloring, Space Carving and GVC all differ in the way they determine 
visibility, the knowledge of which voxels are visible from which pixels in the images. 
A voxel fails to be visible from an image if it projects outside the image or it is 
blocked by other voxels that are currently considered to be opaque. When the opacity 
of a voxel changes, the visibility of other voxels potentially changes, so an efficient 
means is needed to update the visibility. 

Voxel Coloring puts constraints on the camera locations to simplify the visibility 
computation. It requires the cameras be placed in such a way that the voxels can be 
visited, on a single scan, in front-to-back order relative to every camera. Typically, 
this condition is met by placing all the cameras on one side of the scene and scanning 
voxels in planes that are successively further from the cameras. Thus, the 
transparency of all voxels that might occlude a given voxel is determined before the 
given voxel is checked for color consistency. Although it simplifies the visibility 
computation, the restriction on camera locations is a significant limitation. For 
example, the cameras cannot surround the scene, so some surfaces will not be visible 
in any image and hence cannot be reconstructed. 

Space Carving and GVC remove Voxel Coloring’s restriction on camera locations. 
These are among the few reconstruction algorithms for which arbitrarily and widely 
dispersed image viewpoints are not a hindrance. With the cameras placed arbitrarily, 
no single scan of the voxels, regardless of its order, will enable each voxel’s visibility 
in the final model (and hence its color consistency) to be computed correctly. 



* We define an image to be calibrated if, given any point in the 3D scene, we know where it 
projects in the image. However, Saito and Kanade [13] have shown a weaker form of 
calibration can also be used. 
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Several key insights of Kutulakos and Seitz enable algorithms to be designed that 
evaluate the consistency of voxels multiple times during carving, using changing and 
incomplete visibility information, and yet yield a color-consistent reconstruction at 
the end. Space Carving and GVC initially consider all voxels to be opaque, i.e. 
uncarved, and only change opaque voxels to transparent, never the reverse. 
Consequently, as some voxels are carved, the remaining uncarved voxels can only 
become more visible from the images. In particular, if S is the set of pixels that have 
an unoccluded view of an uncarved voxel at one point in time and if S* is the set of 
such pixels at a later point in time, then 5 c 5*. Kutulakos and Seitz assume a color 
consistency function will be used that is monotonic, meaning for any two sets of 
pixels S and S* with S c S*, if S is inconsistent, then S* is inconsistent also. This 
seems intuitively reasonable since a set of pixels with dissimilar color will continue to 
be dissimilar if more pixels are added to the set. Given that the visibility of a voxel 
only increases as the algorithm runs and the consistency function is monotonic, it 
follows that carving is conservative — ^no voxel will ever be carved if it would be 
color-consistent in the final model. 

Space Carving scans voxels for color consistency similarly to Voxel Coloring, 
evaluating a plane of voxels at a time. It forces the scans to be front-to-back, relative 
to the cameras, by using only images whose cameras are currently behind the moving 
plane. Thus, when a voxel is evaluated, the transparency is already known of other 
voxels that might occlude it from the cameras currently being used. Unlike Voxel 
Coloring, Space Carving uses multiple scans, typically along the positive and negative 
directions of each of the three axes. Because carving is conservative, the set of 
uncarved voxels is a shrinking superset of the desired color-consistent model as the 
algorithm runs. 

While Space Carving never carves voxels it shouldn’t, it is likely to produce a 
model that includes some color-inconsistent voxels. During scanning, cameras that 
are ahead of the moving plane are not used for consistency checking, even when the 
voxels being checked are visible from those cameras. Hence, the color consistency of 
a voxel is, in general, never checked over the entire set of images from which it is 
visible. In contrast, every voxel in the final model constructed by GVC is guaranteed 
to be color consistent over the entire set of images from which it is visible. We find 
the models that GVC produces, using full visibility, project to better new views. 



3 Generalized Voxel Coloring 

We have developed two variants of our Generalized Voxel Coloring algorithm. GVC- 
LDI is an enhancement of GVC, the basic algorithm. The carving of one voxel 
potentially changes the visibility of other voxels. When an uncarved voxel’s visibility 
changes, its color consistency should be reevaluated and it, too, should be carved if it 
is then found to be inconsistent. GVC-LDI uses layered depth images (LDIs) [14, 15] 
to determine exactly which voxels have their visibility changed when another voxel is 
carved and thus can reevaluate exactly the right voxels. In the same situation, GVC 
does not know which voxels need to be reevaluated and so reevaluates all voxels in 
the current model. Therefore, GVC-LDI performs significantly fewer color- 
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reconstruction 
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Fig. 1. The data structures used compute visibility. An item buffer (a) is used by GVC and 
records the ID of the surface voxel visible from each pixel in an image. A layered depth image 
(LDI) (b) is used by GVC-LDI and records all surface voxels that project onto each pixel. 



consistency evaluations than GVC during a reconstruction. However, GVC uses 
considerably less memory than GVC-LDI. 

Like Space Carving, both GVC and GVC-LDI initially assume all voxels are 
opaque, i.e. uncarved. They carve inconsistent voxels until all those that remain 
project into consistent colors in the images from which they are visible. 



3.1 The Basic GVC Algorithm 

GVC determines visibility as follows. First, every voxel is assigned a unique ID. 
Then, an item buffer [16] is constructed for each image. An item buffer, shown in 
figure la, contains a voxel ID for every pixel in the corresponding image. While the 
item buffer is being computed, a distance is also stored for every pixel. A voxel V is 
rendered to the item buffer as follows. Scan conversion is used to find all the pixels 
that V projects onto. If the distance from the camera to V is less than the distance 
stored for the pixel, then the pixel’s stored distance and voxel ID are over- written 
with those of V. Thus, after a set of voxels have been rendered, each pixel will contain 
the ID of the closest voxel that projects onto it. This is exactly the visibility 
information we need. 

Once valid item buffers have been computed for the images, it is then possible to 
compute the set vis(V) of all pixels from which the voxel V is visible. Vis(V) is 
computed as follows. V is projected into each image. For every pixel F in the 
projection of V, if P’s item buffer value equals Vs ID, then P is added to vis(V). To 
check the color consistency of a voxel V, we apply a consistency function consist() to 
vis(Vj or, in other words, we compute consist(vis(VO). 

Since carving a voxel changes the visibility of the remaining uncarved voxels, and 
since we use item buffers to maintain visibility information, the item buffers need to 
be updated periodically. GVC does this by recomputing the item buffers from scratch. 
Since this is time consuming, we allow GVC to carve many voxels between updates. 
As a result, the item buffers are out-of-date much of the time and the computed set 
vis(V) is only guaranteed to be a subset of all the pixels from which a voxel V is 
visible. However, since carving is conservative, no voxels will be carved that 
shouldn’t be. During the final iteration of GVC, no carving occurs so the visibility 
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initialize SVL 
for every voxel V 
carved (V) = false 
loop { 

visibilityChanged = false 

compute item buffers by rendering voxels on SVL 
for every voxel V e SVL { 
compute vis (V) 

if (consist (vis (V) ) = false) { 
visibilityChanged = true 
carved (V) = true 
remove V from SVL 

for all voxels N that are adjacent to V 
if (carved (N) = false and N i SVL) 
add N to SVL 

} 

} 

if (visibilityChanged = false) { 
save voxel space 
quit 



Fig. 2. Pseudo-code for the GVC algorithm. See text for details. 



information stays up-to-date. Every voxel is checked for color consistency on the final 
iteration so it follows that the final model is color-consistent. 

As carving progresses, each voxel is in one of three categories: 

• it has been found to be inconsistent and has been carved; 

• it is on the surface of the set of uncarved voxels and has been found to be 
consistent whenever it has been evaluated; or 

• it is surrounded by uncarved voxels, so it is visible from no images and its 
consistency is undefined. 

We use an array of bits, one per voxel, to record which voxels have been carved. 
This data structure is called carved in the pseudo-code and is initially set to false for 
every voxel. We maintain a data structure called the surface voxel list (SVL) to 
identify the second category of voxels. The SVL is initialized to the set of voxels that 
are not surrounded by other voxels. The item buffers are computed by rendering all 
the voxels on the SVL into them. We call voxels in the third category interior voxels. 
Though interior voxels are uncarved, they do not need to be rendered into the item 
buffers because they are not visible from any images. When a voxel is carved, 
adjacent interior voxels become surface voxels and are added to the SVL. To avoid 
adding a voxel to the SVL more than once, we need a rapid means of determining if 
the voxel is already on the SVL; we maintain a hash table for this purpose. 

When GVC has finished, the final set of uncarved voxels may be recorded by 
saving the function carved() or the SVL. Pseudo-code for GVC appears in figure 2. 






106 



W.B. Culbertson, T. Malzbender, and G. Slabaugh 



3.2 The GVC-LDI Algorithm 

Basic GVC computes visibility in a relatively simple manner that also makes efficient 
use of memory. However, the visibility information is time-consuming to update. 
Hence, GVC updates it infrequently and it is out-of-date much of the time. This does 
not lead to incorrect results but it does result in inefficiency because a voxel that 
would be evaluated as inconsistent using all the visibility information might be 
evaluated as consistent using a subset of the information. Ultimately, all the 
information is collected but, in the meantime, voxels can remain uncarved longer than 
necessary and can therefore require more than an ideal number of consistency 
evaluations. Furthermore, GVC reevaluates the consistency of voxels on the SVL 
even when their visibility (and hence their consistency) has not changed since their 
last evaluation. By using layered depth images instead of item buffers, GVC-LDI can 
efficiently and immediately update the visibility information when a voxel is carved 
and also can precisely determine the voxels whose visibility has changed. 

Unlike the item buffers used by the basic GVC method, which record at each pixel 
P just the closest voxel that projects onto P, the LDIs store at each pixel a list of all 
the surface voxels that project onto P. See figure lb. These lists, which in the pseudo- 
code are called LDI(P), are sorted according to the distance of the voxel to the 
image’s camera. The head of LDI(P) stores the voxel closest to P, which is the same 
voxel an item buffer would store. Since the information stored in an item buffer is 
also available in an LDI, vis(V) can be computed in the same way as before. The LDIs 
are initialized by rendering the SVL voxels into them. 

The uncarved voxels whose visibility changes when another voxel is carved come 
from two sources: 

• They are interior voxels adjacent to the carved voxel and become surface voxels 
when the carved voxel becomes transparent. See figure 3a. 

• They are already surface voxels (hence they are in the SVL and LDIs) and are 
often distant from the carved voxel. See figure 3b. 




□ = interior voxel | = voxel with changed visibility 

Fig. 3. When a voxel is carved, there are two categories of other voxels whose visibility 
changes: (a) interior voxels that are adjacent to the carved voxel and (b) voxels that are already 
on the SVL and are often distant from the carved voxel. 
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Voxels in the first category are trivial to identify since they are next to the carved 
voxel. Voxels in the second category are impossible to identify efficiently in the basic 
GVC method; hence, that method must repeatedly evaluate the entire SVL for color 
consistency. In GVC-LDI, voxels in the second category can be found easily with the 
aid of the LDls; they will be the second voxel on LDI(P) for some pixel P in the 
projection of the carved voxel. GVC-LDI keeps a list of the SVL voxels whose 
visibility has changed, called the changed visibility SVL (CVSVL in the pseudo-code). 
These are the only voxels whose consistency must be checked. Carving is finished 
when the CVSVL is empty. 

When a voxel is carved, the LDIs (and hence the visibility information) can be 
updated immediately and efficiently. The carved voxel can be easily deleted from 
LDI(P) for every pixel P in its projection. The same process automatically updates the 
visibility information for the second category of uncarved voxels whose visibility has 
changed; these voxels move to the head of LDI lists from which the carved voxel has 
been removed and they are also added to the CVSVL. Interior voxels adjacent to the 
carved voxel are pushed onto the LDI lists for pixels they project onto. As a 
byproduct of this process, we learn if the voxel is visible; if it is, we put it on the 
CVSVL. Pseudo-code for GVC-LDI appears in figure 4. 



initialize SVL 
render SVL to LDIs 
for every voxel V 
carved (V) = false 
copy SVL to CVSVL 
while (CVSVL is not empty) { 
delete V from CVSVL 
compute vis (V) 

if (consist (vis (V) ) = false) { 
carved (V) = true 
remove V from SVL 

for every pixel P in projection of V into all images { 
if (V is head of LDI (P) ) 

add next voxel on LDI(P) (if any) to CVSVL 
delete V from LDI(P) 

} 

for every voxel N adjacent to V with N i SVL { 
N_is_visible = false 

for every pixel P in projection of N to all images { 
add N to LDI (P) 
if (N is head of LDI (P) ) 

N_is_visible = true 

} 

add N to SVL 
if (N_is_visible) 
add N to CVSVL 

} 

} 

4 

save voxel space 



Fig. 3. Pseudo-code for the GVC-LDI algorithm. See text for details. 
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Fig. 4. Convergence of the algorithms while reconstructing the “toycar” scene. 



4 Results 

We present the results of running Space Carving, GVC, and GVC-LDI on two image 
sets that we call “toycar” and “bench”. In particular, we present runtime statistics and 
provide, for side-by-side comparison, images synthesized with Space Carving and our 
algorithms. The experiments were run on a 440 MHz HP J5000 computer. 

The toycar and bench image sets represent opposite extremes in terms of how 
difficult they are to reconstruct. The toycar scene is ideal for reconstruction. The 
seventeen 800x600-pixel images are computer-rendered and perfectly calibrated. The 
colors and textures make the various surfaces in the scene easy to distinguish from 
each other. The bench images are photographs of a natural, real-world scene. In 
contrast to the toycar scene, the bench scene is challenging to reconstruct for a 
number of reasons: the images are somewhat noisy, the calibration is not as good, and 
the scene has large areas with relatively little texture and color variation. 

We reconstructed the toycar scene in a 167xl21xl01-voxel volume. Four of the 
input images are shown in figure 7. New views synthesized from Space Carving and 
GVC-LDI reconstructions are shown in figure 8. There are some holes visible along 
one edge of the blue-striped cube in the Space Carving reconstruction. The coloring in 
the Space Carving image has a noisier appearance than the GVC-LDI image. 

We used fifteen 765x509-pixel images of the bench scene and reconstructed a 
75x7Ix33-voxel volume. We calibrated the images with a product called 
PhotoModeler Pro [17]. The points used to calibrate the images are well dispersed 
throughout the scene and their estimated 3D coordinates project within a maximum of 
1.2 pixels of their measured locations in the images. Four of the input images are 
shown in figure 9. New views synthesized from Space Carving and GVC 
reconstructions are shown in figure 10. The Space Carving image is considerably 
noisier and more distorted than the GVC image. 

Figures 5 and 6 show the total time Space Carving, GVC, and GVC-LDI ran on the 
toycar and bench scenes until carving completely stopped. They also illustrate the 
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Fig. 5. Convergence of the algorithms while reconstructing the “bench” scene. 



rates at which the algorithms converged to good visual representations of the scene. 
We used reprojection error to estimate visual quality. Specifically, we projected 
models to the same viewpoint as an extra image of the actual scene, and computed the 
errors by comparing corresponding pixels in the projected and actual images. 

Due to the widely varied color and texture in the toycar scene, the color 
consistency of most voxels can be correctly determined using a small fraction of the 
input image pixels that will ultimately be able to view the voxel in the final model. 
Thus, many voxels that must be carved can, in fact, be carved the first time their 
consistency is checked. Space Carving, with its lean data structures, checks the 
consistency of voxels faster (albeit, less completely) than the other algorithms and 
hence, as shown in figure 5, was the first to converge to a good representation of the 
scene. After producing a good model. Space Carving spent a long time carving a few 
additional voxels and was the last to completely stop carving. However, this 
additional carving is not productive visually. Unlike the other two algorithms, GVC- 
LDI spends extra time to find all the pixels that can view a voxel when checking a 
voxel’s consistency. In the toycar scene, this precision was not helpful and caused 
GVC-LDI to converge the slowest to a good visual model. Ultimately, GVC and 
GVC-LDI reconstructed models with somewhat lower reprojection error than Space 
Carving. 

The convergence characteristics of the algorithms were different for the bench 
scene, as shown in figure 6. The color and texture of this scene make reconstruction 
difficult but are probably typical of the real-world scenes that we are most interested 
in reconstructing. In contrast to the toycar scene, many voxels in the bench scene are 
very close to the color-consistency threshold. Hence, many voxels that must be carved 
cannot be shown to be inconsistent until they become visible from a large number of 
pixels. Initially, all three algorithms converge at roughly the same rate. Then Space 
Carving stops improving in reprojection error but continues slow carving. GVC and 
GVC-LDI produce models of similar quality, both with considerably less error than 
Space Carving. After an initial period of relatively rapid carving, GVC then slows to a 
carving rate of several hundred voxels per iteration. Because, on each of these 
iterations, GVC recalculates all its item buffers and checks the consistency of 
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thousands of voxels, it takes GVC a long time to converge to a color-consistent 
model. For the bench scene, the efficiency of GVC-LDI’s relatively complex data 
structures more than compensates for the time needed to maintain them. Because 
GVC-LDI finds all the pixels from which a voxel is visible, it can carve many voxels 
sooner, when the model is less refined, than the other algorithms. Furthermore, after 
carving a voxel, GVC-LDI only reevaluates the few other voxels whose visibility has 
changed. Consequently, GVC-LDI is faster than GVC by a large margin. On both the 
toycar and the bench scene, GVC-LDI used the fewer color consistency checks than 
the other algorithms, as shown in table 1. 



Table 1. The number of color consistency evaluations performed by the algorithms while 
reconstructing the “toycar” and “bench” scenes. 





toycar 


bench 


Space Carving 


12.70 M 


4.24M 


GVC 


3.15 M 


2.54M 


GVC-LDI 


2.14 M 


526M 



All three algorithms keep copies of the input images in memory. The images 
dominate the memory usage for Space Carving. GVC uses an equal amount of 
memory for the images and the item buffers and, consequently, uses about twice as 
much memory as Space Carving, as shown in table 2. The LDls dominate the memory 
usage in GVC-LDI and consume an amount of memory roughly proportional to the 
number of image pixels times the depth complexity of the scene. The table shows that 
GVC-LDI uses considerably more memory than the other two algorithms. Memory 
consumed by the carve and SVL data structures is relatively insignificant and, 
therefore, the voxel resolution has little bearing on the memory requirements for GVC 
and GVC-LDI. 

Table 2. The memory used by the algorithms while reconstructing the “toycar” and “bench” 
scenes. 





toycar 


bench 


Space Carving 


43.2 MB 


26.1 MB 


GVC 


85.7 MB 


53.9 MB 


GVC-LDI 


462.0 MB 


385.0 MB 



Kutulakos and Seitz have shown that for a given image set and monotonic 
consistency function, there is a unique maximal color-consistent set of voxels, which 
they call V*. Since GVC and GVC-LDI do not stop carving until the remaining 
uncarved voxels are all color-consistent and since they never carve consistent voxels, 
we expect them to produce identical results, namely V*, when used with a monotonic 
consistency function. However, monotonic consistency functions can be hard to 
construct. An obvious choice for consist(S') would take the maximum difference 
between the colors of any two pixels in S. However, using distance in the RGB cube 
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as a difference measure, this function is 0(n) on the size of S and has poor immunity 
to noise and high-frequency color variation. We actually use standard deviation for 
the consistency function and it is not monotonic. Consequently, GVC and GVC-LDI 
generally produce models that are different hut similar in quality. 




Fig. 7. Four of the seventeen images of the toycar scene. 




Fig. 8. New views projected from reconstructions of the toycar scene. The image on the left 
was created with Space Carving, the image on the right with GVC-LDI. There are some holes 
visible along one edge of the blue-striped cube in the Space Coloring reconstruction. The 
coloring in the Space Carving image has a noisier appearance than the GVC-LDI image. 
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5 Future Work 

We have devised a reformulation of the GVC algorithm that we hope to implement in 
the near future. Our current implementation renders each voxel into each image twice 
per iteration in the outer loop, once to update the item buffers and a second time to 
gather color statistics for consistency checking. We treat voxels as cubes and render 
them by scan-converting their faces, a process that is time-consuming. In our new 
implementation, we will eliminate the second rendering. Instead, after updating an 
item buffer, we will scan its pixels, using the voxel IDs to accumulate the pixel colors 
into the statistics for the correct voxels. Besides reducing the amount of rendering, 
this approach has three other benefits. First, we will not need the item buffer again 
after processing an image. Thus, the same memory can be used for all the item 
buffers, reducing the memory requirement relative to our current implementation. 
Second, the rendering that is still required can be easily accelerated with a hardware 
graphics processor. Third, the processing that must be performed on one image is 
independent of the processing for the rest of the images. Thus, on one iteration, each 
image can be rendered on a separate, parallel processor. 

We believe LDIs have great potential for use in voxel-based reconstruction 
algorithms. A hybrid algorithm could make the memory requirements for LDIs more 
practical. In such an approach, an algorithm like GVC would be used to find a rough 
model. GVC runs most efficiently during its earliest iterations, when many voxels can 
be carved on each iteration. Once a rough model has been obtained, GVC-LDI could 
be used to refine local regions of the model. The rough model should be sufficient to 
find the subset of images from which the region is visible. The subset is likely to be 
considerably smaller than the original set, so the memory requirements for the LDIs 
should be much smaller. GVC-LDI would run until the model of the region converges 
to color-consistency. If the convergence is slow, GVC-LDI should be much faster 
than GVC. 

The algorithms described in this paper find a set of voxels whose color 
inconsistency falls below a threshold. It would be preferable to have an algorithm that 
would attempt to minimize color inconsistency — ^to find the model that is most 
consistent with the input images. We have already mentioned that LDIs can 
efficiently maintain visibility information when voxels are removed from the model, 
i.e. carved, but, in fact, they can also be used to efficiently maintain visibility 
information when voxels are moved or added to the model. Thus, LDIs could be a key 
element in an algorithm that would add, delete, and move voxels in a model to 
minimize its reprojection error. 



6 Conclusion 

We have described a new algorithm. Generalized Voxel Coloring, for constructing a 
model of a scene from images. We use GVC for image-based modeling and 
rendering — specifically, for synthesizing new views of the scene. Unlike most earlier 
solutions to the new-view synthesis problem, GVC accommodates arbitrary numbers 
of images taken from arbitrary viewpoints. Like Voxel Coloring and Space Carving, 
GVC colors voxels using color consistency, but it generalizes these algorithms by 
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Fig. 9. Four of the fifteen input images of the bench scene. 




Fig. 10. New views projected from reconstructions of the bench scene. The image on the left 
was created with Space Carving. The image on the right was created with GVC. The Space 
Carving image is considerably noisier and more distorted than the GVC image. 



allowing arbitrary viewpoints and using all possible images for consistency checking. 
Furthermore, GVC-LDI uses layered depth images to significantly reduce the number 
of color consistency checks needed to build a model. We have presented experimental 
data and new images, synthesized using our algorithms and Space Carving, that 
demonstrate the benefit of using full visibility when checking color consistency. 
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Discussion 

Yvan Leclerc: Do GVC and GVCLDI give you exactly the same answers? 

Bruce Culbertson: That's a very good question. They would if we used a strictly 
monotonic consistency function, but for various reasons we haven't. We found that it 
was hard to design a strictly monotonic consistency function that was relatively 
immune to image noise. We need to do some averaging, and that upsets the 
monotonicity. So we don't get voxel-for-voxel identical models with the two 
algorithms. But we found that they usually \emph{look} identical and have nearly 
identical reprojection errors. 

Rick Szeliski: What can you say about sampling issues? - A voxel projects into 
several pixels with partial fill. What do you do about that? 

Bruce Culbertson: In general we've used voxels that project into many pixels. We 
really haven't explored what happens when we use very small voxels, so I can't say 
too much about that. I will say that by using voxels that are large relative to the image 
resolution, we get very good noise immunity. It also extends the runtime, but it's 
worth exploring. 

Bill Triggs: Two questions. One: in all of these voxel-based 

approaches there's an $n^3$ scaling rule - the voxel resolution cubed - whereas a 
surface-based approach would be only $n'^2$. That seems to suggest that if you have 
a very high resolution or very large scenes, maybe a voxel-based approach would be 
inefficient compared to a surface-based one. 

Bruce Culbertson: Well, there's one detail that I had to leave out to make my talk 
short enough. Although our model consists of all the opaque voxels, the vast majority 
of these are on the interior of the model and can't be seen from any of the images. 
That makes them really uninteresting, so we've minimized the amount of memory and 
time devoted to dealing with them. So our in-memory representation is actually just 
the surface. 

Bill Triggs: The second question follows on from that. If you somehow get a pixel 
wrong and carve it accidentally when you shouldn't have, it makes a hole in the 
model. Do the holes tend to remain relatively shallow, or can they become deep or 
even punch right through the model, so that the cameras on the other side start carving 
away material too and you eventually end up with empty space? 

Bruce Culbertson: Our color-consistency functions usually have some threshold for 
deciding what we mean by consistency, and if we set that too low we do get holes in 
the model. That makes the algorithm think that the cameras can see right through 
those holes onto the other side of the object's surface, so errors can sometimes 
propagate very badly. When that happens it's often difficult to get intuition about what 
went wrong and figure out which camera saw through what hole to make the model so 
poor. 

Bill Triggs: So do you have any any "repair heuristics" for this ? 

Bruce Culbertson: It would be great to have something like that, but we haven't tried 
it. 
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Abstract. Projective reconstruction recovers projective coordinates of 
3D scene points from their several projections in 2D images. We intro- 
duce a method for the projective reconstruction based on concatenation 
of trifocal constraints around a reference view. This configuration sim- 
plifies computations significantly. The method uses only linear estimates 
which stay “close” to image data. The method requires correspondences 
only across triplets of views. However, it is not symmetrical with respect 
to views. The reference view plays a special role. The method can be 
viewed as a generalization of Hartley’s algorithm im, or as a particular 
application of Triggs ’ mi closure relations. 



1 Introduction 

Finding a projective reconstruction of the scene from its images is a problem 
which was addressed in many works . It is a difficult problem mainly 

because of two reasons. Firstly, if more images of a scene are available, it is 
difficult to see all scene points in all images as some of the points become often 
occluded by the scene itself. Secondly, image data are affected by noise so that 
there is usually no 3D reconstruction that is consistent with raw image data. In 
order to find an approximate solution which would be optimal with respect to 
errors in image data, a nonlinear bundle adjustment has to be performed or an 
approximate methods have to be used. 

In past, the research addressed both problems. Methods for finding a projec- 
tive reconstruction of the scene from many images assuming that all correspon- 
dences are available were proposed On the other hand, if only two, 

three, or four images were used, methods for obtaining a projective reconstruc- 
tion by a linear Least Squares method were presented EEI. 

We concentrate on the situation when there are more than four views and not 
all correspondences are available. We show that a linear Least Squares method 

* This research is supported by the Grant Agency of the Czech Republic under the 
grants 102/97/0480, 102/97/0855, and 201/97/0437, and by the Czech Ministry of 
Education under the grant VS 96049. 
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can be used if there is a common reference view so that the correspondences 
between the reference view and other views exist. 

The proposed approach extends the Hartley’s method for computing camera 
projection matrices dE] for more views than four. His method computes the 
matrices in two steps. Firstly, the epipoles are computed by a Least Squares 
method. Secondly, using the epipoles and image data, the rest of camera projec- 
tion matrices - infinite homographies - are estimated, again, by a Least Squares 
method. 

We assume to have views arranged so that all have some correspondences 
with a same reference view. Therefore we can compute all the epipoles between 
the reference view and the other views. Let the correspondences be, for instance, 
available among triplets of views. Then, trifocal tensors can be estimated inde- 
pendently in each triplet and the epipoles can be computed from the tensors. 
Having the epipoles, one large linear Least Squares problem for computing the 
homographies can be constructed. 

The paper is organized as follows. In section I I . 1 1 the definition of the pro- 
jective reconstruction is given. Section ^3 reviews multifocal constraints. Brief 
overview of existing methods for projective reconstruction is given in section FTIII 
The method for projective reconstruction from many views sharing a reference 
view is introduced in section Q Experiments showing the feasibility of the pro- 
posed method are given in section 0 The work is summarized in section 0 

1.1 Projective Reconstruction 

Let a camera be modeled by a projection from a projective space IP^ to P^. The 
homogeneous coordinates of points in the Tth image are denoted by S P^ 
and homogeneous coordinates of points in P^ are denoted by x. 

Then, the projections of a set of m points by n cameras can be expressed as 

sWfiW ^ p(dx^.^ i = 1, . . . , n, j = 1, . . . , TO, (1) 

where 3x4 real matrix G is a camera projection matrix, G P\{0} 
are scale factors. 

The goal of a projective reconstruction is to find camera matrices P^*^ and 
homogeneous coordinates Xj so that the equation (0) is satisfied for all image 
points i = 1, . . . , n, j = 1, . . . , to. 

Since both P*^®) and Xj are unknown, it is obvious that they can be recovered 
up to a choice of a coordinate system in P^, i.e. up to a homography. Once 
having camera matrices P*^®\ the consequent recovery of points Xj is trivial (and 
vice versa). Therefore, the following definition of projective reconstruction is 
introduced: 

Definition 1 (Projective reconstruction). The recovery of the equivalence 
class P 

P=| (^p(i),...,p(")^ I (^p(i),...,p(”)j = (^p(i)H, ...,P(”)h) , 

H G det(H) 0 } 
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from a set of points i = l,...,n, j = such that there exists a 

corresponding set of points xj G and G P\{0} so that 
is called the projective reconstruction. 

1.2 Multifocal Constraints 

The algorithms for a projective reconstruction are usually based on so-called 
multifocal constraints. The multifocal constraints are derived from m by elimi- 
nating and Xj. Introducing matrix 
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the equations m can be transformed into the equivalent system 



LWp(i) 

L(")p(n) 



Xj = MjXj =0, j = 1, 



m . 



( 2 ) 



Then, the multifocal constraints between and assuring the existence of 
Xj G P^, Xj 0, can be written as 



det = 0 , V j = 1 . . . , m 



( 3 ) 



where is the sub-matrix of consisting of rows t, k, A, /i. It is seen from 

the size of My that rn{^2) such constraints can be constructed. Since rankL*^®) = 
2, it follows that at most of them are linearly independent. Depending on 
the chosen rows, can comprise coordinates of points from two, three, 

or four images and therefore we speak about bifocal, trifocal, or quadrifocal 
constraints respectively. The terms formed by P^*^ are just the components of 
the well-known multi-view (matching) tensors: epipoles, fundamental matrices, 
trifocal and quadrifocal tensors. 

Hence the solution of a projective reconstruction from m points projected 
to n views can be described by a system of polynomial equations Q of 

degree four. To solve such a system appears to be a difficult problem In 

addition, the measured data involve errors in real situations and thus this 
over-constrained system (0) need not have any non-trivial solution. Therefore, 
only an approximate solution can be obtained. 

^ In this context, we shall mention the paper from Bondyfalat, Mourrain, Pan 0. 

They present a method for resolving of over-constraint polynomial systems. 
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1.3 Brief Overview of Existing Methods 

The “ideal” optimization technique for projective reconstruction is based on a 
bundle adjustment, i.e. on minimizing the distances between the original and 
the reprojected image points. Due to the non-linearity and the complexity of 
the problem, this can be solved only by a numerical search (e.g. by a gradient 
descent), which assumes an initial estimate of multi- view geometry, see jlil 1 . 
Let us review the main principles used to obtain an initial estimate of multi- view 
geometry which appeared in the literature and compare them with our approach. 

Method based on a six point projective invariant. The principle of the 
method was firstly introduced by Quan in m- He showed that the solution of 
a projective reconstruction from three views can be expressed in a closed form 
using six point correspondences. It was found that the main disadvantage of 
the method in a practical situations is that even a small error in one of the six 
selected correspondences can completely corrupt the result. Therefore, if the re- 
construction should be correct even in presence of errors in the correspondences, 
some additional optimization has to be employed. An algorithm based on ran- 
dom sampling of the input set of correspondences applied by Torr’s |2] is an 
example. 



Methods based on linearization of matching constraints. The lineariza- 
tion means that a non-linear task is decomposed into several subtasks, which can 
be solved by a least-squares estimate of a linear system. The approaches based 
on linearization are not optimal with respect to noise in image data. They mini- 
mize imaginary algebraic distances instead of the image discrepancies. Therefore 
the estimates should be formulated in a way, so that noise in input data does 
not skew the solution too much. The results provided by a linearization can be 
used as an initial estimate for the numerical search in a gradient optimization 
technique. 

The linearization of matching constraints makes use of the two facts: (i) the 
matching constraints (0 are linear in multi- view tensors, (ii) the multi- view 
tensors or their special combinations can be linearly decomposed into projection 
matrices Thus, the nonlinear problem of projective reconstruction can be 
approximated in two lineaifl steps: 



1. Estimate multi-focal tensor(s) from image data. 

2. Decompose multiple view tensor(s) to projection matrices i = 1, . . . , n. 



Firstly, the methods based only on one matching tensor (bifocal, trifocal, or 
quadrifocal) were developed, i.e. the methods only for projective reconstruction 
from 2, 3, or 4 views. For detailed description see [tilt)l9l I 1 1 1 1 )l 121 1 91 1 ^WZ^ . 

Considering more than four views, the task becomes more complicated. Triggs 
described in m the method how to concatenate more matching tensors such 



^ “Linear problem” is meant in the sense that the problem is equivalent to a Least 
Square solution of a system of linear equations. 
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that they cover n views and that the relation^ between the tensors and the 
projection matrices are linear. However, the tensors have to be scaled consistently 
at first. A general way of concatenating the trifocal tensors was presented also 
by Avidan and Shahsua in m- A different approach P is based on threading of 
fundamental matrices using trifocal tensors. The estimation is performed only by 
triplets of views in a sequence and not from all n views at once. Therefore, errors 
of consecutive estimations may cumulate during the process. A reconstruction 
from many views under additional constraints was presented e.g. by Fitzgibbon 
et al. 13 . 

The decomposition of the non-linear problem to two consecutive steps brings 
the following controversy: since the image data are affected by noise, only ap- 
proximations of matching tensors are received by the first estimate. These ap- 
proximations does not have to fulfill all the tensor constraints and the successive 
decomposition to projection matrices becomes unstable. 

Hartley m has introduced the following improvement: estimate only the 
epipoles from the given tensor and then estimate the projective matrices directly 
from image data. This finesse stabilizes the process significantly. So far, this 
important improvement was known only for the projective reconstruction based 
on one matching tensor, i.e. only for projective reconstruction from 2, 3, or 4 
views. The improvement is impossible in general for the chains of tensors derived 
by Triggs. 

Sturm and Triggs m improved the method for n views in a different way. 
They proposed to recover only ‘projective’ depths of image points from closure 
relations. Then, so-called joint image matrix c&n be constructed from the scaled 
image points. This matrix can be directly factorized into projection matrices 
(using SVD). The disadvantage of this technique is that the joint image matrix 
can contain only the points which are observed in all n views. The points seen 
only in some of n views cannot be involved in computations. 

In section 13 we present an algorithm for projective reconstruction from n > 
4 views. The algorithm is based on a linearization of the trifocal constraints 
concatenated around a reference view. This configuration is a special application 
of Trigg’s e— G— e closure relation and the common reference view simplifies the 
computations significantly. 

It allows to estimate projection matrices (using the epipoles) directly from 
image data analogically to Hartley’s improvement W- Furthermore, the prob- 
lem of the consistent tensor scaling disappears in this case. The used image points 
have not to be observed in all n views. It is sufficient if they are observed in a 
triplet of views, where one of the views is the reference view. 



® They are called joint image closure relations in m 
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Fig. 1. Cake configuration of n — 1 triplets of views. 



2 Projective Reconstruction from Trifocal Constraints 
with a Common Reference View 

Let us consider that n views are covered by trifocal constraints so that they have 
the common reference view. (The example of such a configuration is for instance 
the “Cake” configuration, see Figure Then, the following two facts hold. 

Theorem 1. The trifocal tensor is related to image data by the linear 

constraint 




where a = 1, . . . , n— 1 indexes the triplets of views, b = a+1 and c = (a+2) modn 
indexes the views and 1^^^ denotes X-th row of from 

Proof. See (lYll ij . □ 



Theorem 2. The tensors can be expressed as linear forms of the projection 
matrices and the epipoles 

Proof. The general relation between the trifocal tensor epipoles 

and P(i)^p(6)^p(0 is 

^(a).kp(l) ^ g(c)fep(b) _ gWpOfe ^ fc = 1, 2, 3 , 



( 5 ) 
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where is 3 x 3 matrix (• denotes the free indexes), denotes 

the fc-th component of and the fc-th row of see (Triggs 

calls it e— G— e closure relation). Considering = [1,0], P^*”^ = 
and = [A('^\e^°)], we obtain 

jy).k ^ ^(c)kj^{b) _ ^(b)^(c)k ^ k = 1,2, i , ( 6 ) 

where denotes the /c-th row of □ 

When combining with 0, the following consequence is evident. 

Consequence 1 Let n views be eovered by trifoeal tensors so that they all have 
a eommon referenee view. Then, having the epipoles the eamera matriees 
P^^^ ean be estimated direetly (and linearly) from image data. 

Thus, the complete algorithm for projective reconstruction can be outlined 
as follows: 

1. Estimation of epipoles . There are more ways how to estimate epipoles 

e^'^\ e.g. via bifocal, trifocal or quadrifocal constraints. We consider the 
estimation from neighboring tensors as the most ’’natural” (i.e. 

as the common perpendicular to six null spaces of matrices ^ ^(a+i)»fe^ 

fc = 1, 2, 3). where a = 1, . . . , n — 1, c = (a + 2) modn. 

2. Estimation of from image data using e^^^, b = 2, . . . ,n. 

The detail description of the estimation for the “Cake” configuration is given 
in Appendix 



3 Experiments 

In all experiments, the “Cake” configuration of the trifocal constraints is used, 
i.e. the trifocal constraints arise from the view triplets (1,2,3), (1,3,4), ..., 
(l,n,2), see Figure □ 

3.1 Experiment on Synthetic Data 

In the first experiment, we have tested the accuracy and the stability of the 
algorithm with respect to noise for 3 and 5 views. An artificial scene consisted of 
a set of 40 points distributed randomly (uniform distribution) in a cube of size 
1 meter. A camera with viewing angle 63° was used. Image size was 1000 x 1000 
unitiB Image data were corrupted by Gaussian noise with standard deviation 
increasing gradually from 0.5 to 25 image units. The reprojection error of the 
reconstruction provided by the above proposed algorithm was evaluated for 200 

^ The image units get a physical meaning with respect to the focal length /. One 
image unit is ^ sin(|a)/ in this experiment, where a is the viewing angle. 
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Reprojection error in image No. 1. Reprojection in image No. 2. 



Fig. 2. Variance of the reprojection error vs. variance of noise in image data. 



measurements for a given value of noise. The camera positions were selected 
randomly in the distance between 2 and 2.5 meters from the scene center. 

The results of the experiment are illustrated on Figures |21 and 0 Since the 
tested algorithm is symmetric with respect to images 2, 3, 4, and 5, we present 
the errors only for image 1 and 2. It is seen from Figures 0 and 0 that the 
reprojection error increases linearly in the tested range of noise. Furthermore, it 
is seen, that the accuracy of the results increases with the number of images. 

3.2 Experiments on Real Data 

The behavior of the proposed algorithm was tested on two sets of real images. 
The first set captures the house from Kampa, the second sets captures a card- 
board model of a toy house. The experimental software CORRGUI |4l24j was 
used to select the correspondences manually, to define polygonal faces of the 
reconstruction, and to map texture from images onto the reconstruction. 



House at Kampa. We have taken 7 images of a house using an uncalibrated 
photographic camera, see Figure 0 Then, the photographs were digitized in the 
resolution 2393 x 3521 pixels. 

Point correspondences in image triplets were assigned manually. The follow- 
ing numbers of correspondences were selected: 

[Image triplet ||(1, 2, 3) | (1, 3, 4) | (1, 4, 5) | (1, 5, 6) | (1, 6, 7) 

[Number of correspondences jj 67 | 62 j 46 j 51 j 82 | 82 
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(a) 3 views, noise 5 i.u. 



(b) 5 views, noise variance = 5 i.u. 




(c) 3 views, noise variance = 15 i.u. (d) 5 views, noise variance = 15 i.u. 



Fig. 3. Histograms of 200 measurements of the reprojection error in image No. 2 for 
noise level 5 and 15 image units. 



The projective reconstruction from all 7 images was performed. The following 
table shows the maximal and the average distances between input and repro- 
jected image points: 



image No. 


1 


2 


3 


4 


5 


6 


7 


Maximal error [pxl] 


13.4 


27.2 


12.2 


8.6 


14.1 


00 

bo 


8.9 


Mean error [pxl] 


1.3 


2.3 


2.6 


1.9 


6.7 


1.4 


3.1 


Median [pxl] 


1.0 


1.7 


1.9 


1.6 


6.2 


1.1 


3.4 



The recovered class P of the projective reconstruction was used as the input 
for Pollefeys’ algorithm HSl computing a similarity reconstruction of the house, 
see Figure 0 
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Fig. 4. Seven input images of the house at Kampa. 




Fig. 5. Two views on the recovered 3D model of the house at Kampa with image 
texture mapped onto the reconstruction. 



“Toy” House. In this experiment, 10 images of a “toy” house model were 
taken, see Figure 0 by an uncalibrated photographic camera and then digitized 
in the resolution 2003 x 2952 pixels. Point correspondences were assigned man- 
ually across image triplets. 
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Fig. 6. Ten inpnt images of the “toy” house and two views of the recovered 3D En- 
clidean model. 
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The following numbers of correspondences were selected: 



Image triplets 


(1,2,3) 


(1,3,4) 


(1,4,5) 


(1,5,6) 


(1,6,7) 


Number of correspondences 


48 


60 


54 


20 


41 


Image triplets 


(1,7,8) 


(1,8,9) 


(1,9,10) 


(1,10,2) 




Number of correspondences 


56 


41 


59 


71 





The projective reconstruction from all 10 images was done. The following 
table shows the maximal and the average distances between the input and the 
reprojected image points: 



Image No. 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


Maximal error [pxl] 


4.9 


10.1 


8.2 


5.7 


9.9 


7.9 


6.4 


5.9 


8.9 


8.0 


Mean error [pxl] 


1.4 


3.7 


2.9 


2.0 


3.1 


2.7 


1.7 


2.1 


1.66 


2.0 


Median [pxl] 


1.1 


3.4 


2.7 


1.7 


2.7 


2.2 


1.5 


1.7 


1.4 


1.8 



The Euclidean model (Figure EJ was recovered from the projective one by 
assigning 3D Euclidean coordinates to 5 points. 

4 Conclusions 

We have presented a new approach for projective reconstruction from a set of 
n views if n > 4. The views are grouped by triplets having a reference view 
in common. There are two important advantages of the proposed approach. 
Firstly, the existence of a common reference view allows to construct one large 
over-determined linear system for all homographies in the projection matrices. 
Secondly, correspondences are needed only among the triplets of views containing 
the reference view. Thus, in this special arrangement with one reference view, a 
simultaneous estimate of all the homographies can be obtained even though not 
all correspondences are available. On the other hand, since there is a reference 
view in a special position, the method is not symmetrical with respect to all 
images. 

A Complete Algorithm 

1. Estimate the epipoles b — 2, . . . ,n, 
e.g. through the trifocal tensors. 

2. Estimate A & = 2,...,n. 

The matrices are constrained by and e^^^ up to three free 
parameters (see Appendix 
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a) Let us select the element of Ki orthogonal (for instance) to f = 

e(2) 



. Let V°, V° G i):3(n 1 ) jg matrix which columns 



e(") 

form a basis of the space orthogonal to vector f and let G 

then the selected solution can be expressed as 



: =V°z, , * = 1,2,3 . (7) 

L^i J 

b) Substituting (0) and (0 to we can formulate the linear estimate 
directly for z^, i = 1, 2, 3, 







'GV° 


0 


0 




Zl 








Zl 




minimize 


D 


0 


GV° 


0 




Z2 




subject to 




Z2 








0 


0 


GV°_ 




.^3_ 








.^3_ 





where the matrix D is composed from image data . 

c) The columns . . . , a-"\ * = 1,2, 3, can be easily obtained by the 
back projection O- 



B From and to 



Consider n—l tensors of the triplets (1, 2, 3) , , (1, **, 2) . As- 

sume that we have already performed the estimations of j = 2, . . . , n. 
Equations can be rewritten to the matrix form 



vector9(7)‘'^^**) 








= G 




vectorg)?'*'"”^^**) 




(") 

L^i J 



where G is 9{n — 1) x 3(n — 1) matrix 







0 0 




0 0 O' 


e(3)2i 




0 0 




0 0 0 


e(3)3i 




0 0 




.0 0 0. 
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0 






0 
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0 





0 0 




0 0 O' 


0 0 




0 0 0 


0 0 




_0 0 0_ 



e(2)2i 

e(2)3i 



G = 
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Since the kernel of G is generated by the vectoJl 

e(2) 

; 5 

ein) 

for given and there exists a one dimensional set satisfying ® 
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Discussion 

Richard Hartley: In my algorithm I compute the epipoles, then there’s an 
optional phase in which you iterate over the positions of the epipoles to get a 
global minimum of algebraic error. Could you do the same sort of thing here? 

Tomas Pajdla: Yes, our method can be seen as the first step where we get 
some kind of initial estimate. We could follow this with iterations or proceed 
with bundle adjustment to optimize the real reprojection error in the images. 
We haven’t done this here, but if the initialization is good the bundle adjustment 
will converge. 

Richard Hartley: Just iterating over the position of the epipoles would be a lot 
simpler and faster than bundle adjustment, if you have a lot of matched points. 

Tomas Pajdla: Yes. 

Kalle Astrom: In your house sequence there seemed to be something wrong 
with the texture mapping. 

Tomas Pajdla: Yes, you’re right. The texture mapping is provided by the 
VRML viewer, which can only map textures that were taken frontoparallelly. It 
splits the polygons into triangles, then maps each triangle independently using 
an affine transformation. For non-frontoparallel textures, this introduces discon- 
tinuities across the edges, which are visible for big polygons. It’s a problem with 
the VRML standard and the VRML consortium ought to fix it. 

Note: The texture mapping was fixed in the final version of the paper, by 
recursively subdividing the triangles until the error of the affine texture mapping 
was less than one pixel. 

Marc Pollefeys: In some cases it may be very hard to have a global common 
view, but you might have a lot of views from all around an object. Could you 
use this technique to reconstruct small patches around several central views, 
and then somehow stitch all these patches together? I mean something like the 
Kanade dome, where you could choose some of the cameras as central ones, 
surrounded by others. Could you patch the whole cage structure together? Or 
would you work with trifocal tensors in this case? 

Tomas Pajdla: We only developed the work for a single central view. One 
advantage is that we can estimate the homographies from all the image data. 
But it would probably be possible to generalize it by gluing things together, 
maybe using the trifocal tensor as in other works. I don’t know whether it would 
be any better than just using the trifocal tensor — you’d have to try it. 
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Abstract. This paper generalizes the parameterized image variety ap- 
proach to image-based rendering proposed in [Sj so it can handle both 
points and lines in a unified setting. We show that the set of all images of 
a rigid set of m points and n lines observed by a weak perspective camera 
forms a six-dimensional variety embedded in ^ parameteriza- 

tion of this variety by the image positions of three reference points is 
constrncted via least squares techniques from point and line correspon- 
dences established across a seqnence of images. It is used to synthesize 
new pictures without any explicit 3D model. Experiments with real im- 
age sequences are presented. 



1 Introduction 

The set of all images of m points and n lines can be embedded in a 2(m + n)- 
dimensional vector space E, but it forms in fact a low-dimensional subspace V 
of E: as will be shown in the next section, U is a variety (i.e., a subspace defined 
by polynomial equations) of dimension eight for affine cameras, and an eleven- 
dimensional variety for projective cameras. But V is only a six- dimensional va- 
riety of E for weak perspective and full perspective cameras. We propose to 
construct an explicit representation of V, the Parameterized Image Variety (or 
PIV) from a set of point and line correspondences established across a sequence 
of weak perspective or paraperspective images. The PIV associated with a rigid 
scene is parameterized by the position of three image points, and it can be used 
to synthesize new pictures of this scene from arbitrary viewpoints, with applica- 
tions in virtual reality. 

Like other recent approaches to image synthesis without explicit 3D models 
ITJUlisl . our method completely by-passes the estimation of the motion and 
structure parameters, and works fully in image space. Previous techniques exploit 
the affine or projective structure of images but ignore the Euclidean 

constraints associated with real cameras; consequently, as noted in jI3j, the 
synthesized pictures may be subjected to affine or projective deformations. Our 
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method takes Euclidean constraints into account explicitly and outputs correct 
images. 

Parameterized image varieties were first introduced in as a technique for 
parameterizing the set of images of a fixed set of points. Here we recall the 
original method and present a completely new extension to the problem of pa- 
rameterizing the set of images of a fixed set of lines (a potential advantage of 
lines over points is that they can be localized very accurately in edge maps) . We 
show how both the point and line PIVs can be integrated in a general frame- 
work for image synthesis without explicit 3D models, and present preliminary 
experiments with real images. 

1.1 Background 

Recent work in computer graphics mm and computer vision mmm has 
demonstrated the possibility of displaying 3D scenes without explicit 3D mod- 
els {image-based rendering). The light field techniques developed by Chen |3|, 
Gortler, Grzeszczuk, Szeliski and Cohen |H], and Levoy and Hanrahan P! are 
based on the idea that the set of all visual rays is four-dimensional, and can thus 
be characterized from a two-dimensional sample of images of a rigid scene. 

In contrast, the methods proposed by Laveau and Faugeras Seitz and 
Dyer Kutulakos and Vallino H21, and Avidan and Shashua [Q only use a 
discrete (and possibly small) set of views among which point correspondences 
have been established by feature tracking or conventional stereo matching. These 
approaches are related to the classical problem of transfer in photogrammetry: 
given the image positions of tie points in a set of reference images and in a new 
image, and given the image positions of a ground point in the reference images, 
predict the position of that point in the new image (3 . 

In the projective case, Laveau and Faugeras m have proposed to first esti- 
mate the pair-wise epipolar geometry between the set of reference views, then 
reproject the scene points into a new image by specifying the position of the new 
optical center in two reference images and the position of four reference points in 
the new image. Once the feature points have been reprojected, realistic rendering 
is achieved using classical computer graphics techniques such as ray tracing and 
texture mapping. Since then, related methods have been proposed by several 
authors in both the affine and projective cases in™ . The main drawback of 
these techniques is that the synthesized images are in general separated from 
the “correct” pictures by arbitrary planar affine or projective transformations 
(this is not true for the method proposed by Avidan and Shashua , which syn- 
thesizes correct Euclidean images, but assumes calibrated cameras and actually 
estimates the (small) rotation between the cameras used at modeling time). 

The approach presented in the rest of this paper generates correct images 
by explicitly taking into account the Euclidean constraints associated with real 
cameras. In addition, it integrates both point- and line-based image synthesis in a 
common framework. We are not aware of any other line-based approach to image- 
based rendering, although structure-from-line-motion methods have of course 
been proposed in the past (see, for example, for recent approaches). 
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and image-based rendering is close in spirit to methods for transfer based on the 
trifocal tensor Uni, that are in principle applicable to lines p. 

1.2 The Set of Images of a Rigid Scene 

Let us first consider an affine camera observing some 3D scene, i.e., let us assume 
that the scene is first submitted to a 3D affine transformation and then ortho- 
graphically projected onto the image plane of the camera. We denote the coordi- 
nate vector of a scene point P in the world coordinate system by R = {x, y, z)^. 
Let p = (u, denote the coordinate vector of the projection p oi P onto the 
image plane, the affine camera model can be written as 

P = MP + Pq (1) 



with 



iind Po=(“°). 



Note that pg is the image of the origin of the world coordinate system. 

Suppose we observe a fixed set of points Pi {i = l,..,m) with coordinate 
vectors Pi, and let Pj denote the coordinate vectors of the corresponding image 
points. Writing o for all the scene points yields 




fx y z 0 0 0 1 0\ 
l^OOOajyzOiy 




where 



U = (ui, . . . ,UmY' , 
V = (m, . . . ,VmV , 

1 = (1,... ,ir, 

O = (0,... ,0)^, 



X (Xl , . . . , X<YYi) , 

y = (yi,-- - ,VmV, 

Z= {zi,... ,ZmV, 



and it follows that the set of images of m points is an eight-dimensional vector 
space Vp embedded in 

Let us now consider a line A parameterized by its direction 17 and the vector 
D joining the origin of the world coordinate system to its projection onto A. 
We can parameterize the projection S oi A onto the image plane by the image 
vector d that joins the origin of the image coordinate system to its orthogonal 
projection onto S. This vector is defined by the two constraints 

\d-{MD+p,) = \d\\ 

It follows that the set of all affine images of n lines is an eight-dimensional 
variety Vj embedded in and defined by the 2n equations in 2n-|-8 unknowns 
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(namely, the coordinates of the vectors di (i = 1, n) associated with the n lines 
and the coordinates of the vectors a, b and Pq) obtained by writing (|21) for n 
lines. More generally, the set of all affine images of m points and n lines is an 
eight-dimensional variety V embedded in j^ 2 (m-i-ra)^ 

Let us now suppose that the camera observing the scene has been calibrated 
so that image points are represented by their normalized coordinate vectors. 
Under orthographic projection, and are the first two rows of a rotation 
matrix, and it follows that an orthographic camera is an affine camera with the 
additional constraints 

|ap = |6p = 1 and a ■ b = 0. (3) 

Likewise, a weak perspective camera is an affine camera with the constraints 
|ap = |6p and a-b = 0. (4) 

Finally, a paraperspective camera is an affine camera with the constraints 



a ■ b = 



2{l + ui)' 



2 ( 1 + <) 



\b(^ 



and 



{l + v^,)\a\^ = {l + ul)\b\\ 

where {ur,Vr) denote the coordinates of the image of the reference point associ- 
ated with the scene (see [E] for the use of similar constraints in Euclidean shape 
and motion recovery). 

As shown earlier, the set of affine images of a fixed scene is an eight-dimension- 
al variety. If we restrict our attention to weak perspective cameras, the set of 
images becomes the six-dimensional sub-variety defined by the additional con- 
straints 0). Similar constraints apply to paraperspective and true perspective 
projection, and they also define six-dimensional varieties. We only detail the 
weak perspective case in the next three sections; the extension to the paraper- 
spective case is straightforward [7| . Extending the proposed approach to the full 
perspective case would require eliminating three motion parameters among five 
quadratic Euclidean constraints, a formidable task in elimination theory m- 



2 Parameterized Image Varieties 

We propose a parameterization of the six-dimensional variety formed by the weak 
perspective images of m points and n lines in terms of the image positions of 
three points in the scene. This parameterization defines the parameterized image 
variety (or PIV) associated with the scene. Let us suppose that we observe 
three points Aq, Ai, A 2 whose images are not collinear (see Fig. DJ. We can 
choose (without loss of generality) a Euclidean coordinate system such that 
the coordinate vectors of the three points in this system are Aq = (0,0,0)^, 
Ai = (1, 0, 0)^, A 2 = {p, q, 0)^ (the values of p and q are nonzero but unknown). 
These points will be used to parameterize the PIV in the next two sections. 
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2.1 The Point PIV 

This section briefly summarizes the presentation of jS|. Let us consider a point 
P and its projection p in the image plane, and denote by P = (x, y, z)’^ and p — 
their coordinate vectors. The values of (x,y,z) are of course unknown. 
We will also assume that uq = vq = 0 since we can go back to the general case 
via an image translation. Applying m to Ai, A 2 and P yields 



where 




( 5 ) 



In turn, we have a — Bu and b = Bv, where 

{ A = -p/q, 

A* = 1 / 9 , 

a = -{x + Xy), 

P = -yy- 

Letting C z^B^’^B, the weak perspective constraints 0 can now be rewrit- 
ten as 



/I 0 0 \ 

B A~^ = I A p 0 j and 
\a/z f3/z 1/zJ 



( vACu — v^Cv = 0 , 
I vACv = 0, 



/ Cl C 2 Oi\ 

C = C2 Ca /? and 
\a (3 1/ 



with 



Ci = (l + A2)z2 + a2, 
C 2 = Xpz"^ + a/3, 
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This equation defines a pair of linear constraints on the coefficients (i = 
1, 2, 3), a and /?; they can be rewritten as 

where 

di (u^, 2 uiU2, 2miU, 2u2U, 

^2 (v^,2viV2,v^,2viv,2v2V,v^)^, 

d = {u\Vi,UiV2 + U2V\,U2V2 tUiV + UVi,U2V + UV2 tUv)'^ , 

.1=^ (Cl, 6,6, a, A 1)^- 

When the four points Aq, Ai, A 2 , and P are rigidly attached to each other, 
the five structure coefficients ^ 1 , ^ 2 , Cs, ct and j3 are fixed. For a rigid scene 
formed by m points, choosing three of the points as a reference triangle and 
writing 0 for the remaining ones yields a set of 2m — 6 quadratic equations in 
2m unknowns that define a parameterization of the set of all weak perspective 
images of the scenes. This is the PIV. Note that the weak perspective constraints 
(0 are linear in the five structure coefficients. Thus, given a collection of images 
and point correspondences, we can compute these coefficients through linear 
least-squares (see jOl for an alternative solution). 

Once the vector ^ has been estimated, we can specify arbitrary image po- 
sitions for our three reference points and use (0) to compute u and v. A more 
convenient form for this equation is obtained by introducing 

p def ( il — ^2 — Oi(3\ _ /(I + 

U2-«/3 6-/3V “ V w 



and defining U 2 = (ui,U 2 )^ and V 2 = (ui,U 2 )^. This allows us to rewrite (gj as 



J + Cl — 62 — 0, 

\ 2XY -k e = 0, 



where 



ei = u ^£ u 2 , 
62 = v'^£v 2, 
e = 2 u 2 £v 2 , 



and 



\X = u + aui + (3 u2, 
I F = u -k avi + !3 v2- 



It is easy to show p| that only two of the {X, Y) pairs of solutions of (0) are 
physically correct, and that they can be computed in closed form. The values of 
u and V are then trivially obtained. 



2.2 The Line PIV 

Line position. Let us now consider a line A and assume that its intersection 
with the reference plane spanned by the points Aq, Ai and A 2 is transversal (see 
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Fig.lU. Without loss of generality, we can parameterize this line by the affine 
coordinates (xijX 2 ) of the point Q in the basis (Aq, Ai, A 2 ), i.e., 

Q = Ao + xi^i + X2-42, 

and by the coordinate vector f2 = {x,y,l)'^ of its direction in the Euclidean 
world coordinate system. 

Let S denote the projection of A. We can parameterize this line by the 
position of the image q of the point Q and the unit coordinate vector uj = 
(cos 0, sin 0)^ of its direction (see Fig.|^. If we take as before oq as the origin of 
the image plane, and denote by d the distance between ag and S, the equation 
of S is 

—usinO + vcos0 — d = O, (8) 

where (u,v) denote image coordinates. Since the point Q lies in the reference 
plane, the affine coordinates of q in the coordinate system 00 , 01,02 are also xi 
and X 2 and substituting in JB|) yields 

{ui sin 9 — vi cos 9)xi + (02 sin 9 — V 2 cos 9)x2 + d = 0, 

which is a linear equation in xi and X 2 - Given several images of the line A, we 
can thus estimate xi and X 2 via linear least-squares. These affine coordinates 
can then be used to predict the position of q in any new image once oq, oi and 
02 have been specified. 




Fig. 2. Parameterization of <5. 



Line orientation. Let us now turn to the prediction of the orientation 9 of the 
image line. The equations derived in Section l2. Il still apply when we take P = fl 
and p = pu), where p is an image-dependent scale factor. Note that since the 
overall value of p is irrelevant, we can take z = 1 without loss of generality. 

There are two differences with the point case: (a) the parameters Ci (1 = 
1, 2, 3), a and j3 are no longer independent since there is no z parameter to take 
into account, and (b) the equations in (Q contain terms in p, that depend on 
the image considered since this time 

X = pcos9 + au\ + fdu 2 and Y = psui9 + av\ + j3v2- 
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Substituting these values in 0 and eliminating p yields, after some algebraic 
manipulation 



with 



/i2 - gi 2 cos 29- g sin 29 = \/(ei - ea)^ -h e^, 



( fl2 uJTU2 + v\TV2, 

< g 2U2GV2, 

I def ^ „ 

I g\2 = U2QU2 - V2QV2, 



and 






a 



a/3 
a/3 /3^ 

6 Ca 



(9) 



This allows us to construct a minimal set of four structure parameters ei, 
£2, £3 and & by introducing 7 = ^ya‘^ + and defining 

r£i = (i + A2)/72, 

< £2 = A^/7^, and 0 = Arg(a,/3). 

[ea = MV7^ 

With this notation, (|2|) becomes 



(*i + *2) - (^1 - /i2 + *1 - *2) cos 20 - (/i -I- i) sin 20 = y/ (/ii - /i2)^ + (10) 



where 



h 2 U 2 HV 2 , 
hi U2HU2, 

h2 V2'HV2, 

. def _ 

I = U 2 -LV 2 , 

. def 1 

= 2'^2-^U2, 

. def 1 rp 

l2 = 2'^iXV2, 



and 



^d^f 



£l £2 
£2 £3 



7- t^f f\{3 + cos 20) sin 20 A 
sin 20 1(1 — cos20)y 



Given a set of line correspondences, the four structure parameters £1, £2, 
£3 and 0 can be estimated via non-linear least-squares. At synthesis time, (HDD 
becomes a trigonometric equation in 20, with two solutions that are easily com- 
puted in closed form. Each of this solution only determines 0 up to a tt ambiguity, 
which is immaterial in our case. 

Note that directly minimizing the error corresponding to m is a biased 
process. A better method is to minimize ~ where 9i is the line 

orientation empirically measured in image number i, and 9i is the orientation 
predicted from m- This is a constrained minimization problem, and the deriva- 
tives of 9i with respect to the structure parameters are easily computed from 
the partial derivatives of (cni) with respect to 0 and these parameters. 

Prom lines to line segments. Infinite lines are inappropriate for realistic 
graphical display. Thus we must associate with each of them a finite line seg- 
ment that can be passed to a rendering module. On the other hand, while lines 
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can be localized very accurately in the input images, the position of their end- 
points cannot in general be estimated reliably since most edge finders behave 
poorly near edge junctions. Additional line breaks can also be introduced by the 
program that segments edges into straight lines. Here we present a method for 
computing an estimate of the endpoints of a line from its PIV. 

Let R denote one of the endpoints of the segment associated with the line 
A, and let r denote its image. We have r — q = lu>, and R — Q = Lf2, and we 
can once again write pu) = Aif2, where this time p = 1/ L. The (signed) distance 
I is known at training time and unknown at synthesis time, while the (signed) 
distance L is unknown at training time and known at synthesis time. It is easily 
shown that 



X" + ^ 7V(/^i - ^2)2 + 

and expanding this expression yields a quadratic equation in p 

p^ — 2cp + d = 0, 



( 11 ) 



where 

{ d0f 

c = 7 ((ui cos 0 + U 2 sin 0) cos 9 + {vi cos 0 + V 2 sin 0) sin 9), 
d 72^(112 - 

Note that we can compute 7 from the vector (£i,£ 2 ,e 3 )^ as 

7 = \/e3/(eie3-ei)- 

During training, (CU can be used to estimate \L\ from a single image or from 
several ones (via non-linear least squares). During image synthesis, we estimate 
1^1 as \pL\, and give a sign to I using tracking from a real image. 



3 Image Synthesis 

Once the PIV parameters have been estimated, the scene can be rendered from 
a new viewpoint by specifying interactively the image positions of the three 
reference points, and computing the corresponding image positions of all other 
points and lines. To create a shaded picture, we can construct a constrained 
Delaunay triangulation of these lines and points whose vertices and edges 
will be a subset of the input points and line segments. Texture mapping is then 
easily achieved by using the triangulation of one of the input images. This section 
details the main stages of the rendering process. 

3.1 Integration of Point and Line PIVs 

Once the structure parameters associated with all the points and lines have 
been computed, these can be used to construct a refined estimate of the pa- 
rameters 1 -I- A^, Xp and that are common across all features. Indeed, the 
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vectors (^i — a'^,^2 — ctP,^3 — and (ei,e2,£3)^ associated with the var- 

ious lines and points all belong to the one-dimensional vector space spanned 
by (1 -I- A^, A/i, A representative unit vector {rji, 772, 773)^ can be found via 
singular value decomposition, and we obtain 

r i + a 2 = 771773/(771773-771), 

< A/r = mvs/ivm - vl), 

[ = vam/imm - vl)- 

Once these common structure parameters have been estimated, a better es- 
timate of the point PIV can be constructed via linear least squares [3. In the 
case of lines, HDD becomes an equation in 7^ and 20 only, and better estimates 
of these parameters can be computed once again via non-linear least squares. 

With the line segments associated to each line and the refined structure 
parameters in hand, we are now in a position to construct a shaded picture. 
As noted before, we can construct a constrained Delaunay triangulation of the 
line segments and points (using, for example Shewuck’s Triangle public-domain 
software ED]) whose vertices and edges form a superset of the input points and 
line segments. Texture mapping is then easily achieved by using the triangulation 
of one of the input images. 

3.2 Hidden-Surface Removal 

Here we show how traditional z-buffer techniques can be used to perform hidden- 
surface elimination even though no explicit 3D reconstruction is performed. The 
technique is the same as in and it is summarized here for completeness. Let 
n denote the image plane of one of our input images, and U' the image plane of 
our synthetic image. To render correctly two points P and Q that project onto 
the same point r' in the synthetic image, we must compare their depths. 




Fig. 3. Using a z-buffer without actual depth values. 



Let R denote the intersection of the viewing ray joining P to Q 
plane spanned by the reference points Aq, Ai and A2, and let p, q, 



with the 
r denote 
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the projections of P, Q and R into the reference image. Suppose for the time 
being that P and Q are two of the points tracked in the input image; it follows 
that the positions of p and q are known. The position of r is easily computed by 
remarking that its coordinates in the affine basis of U formed by the projections 
tto, ai, 02 of the reference points are the same as the coordinates of R in the affine 
basis formed by the points Aq, Ai, A 2 in their own plane, and thus are also the 
same as the coordinates of r' in the affine basis of II' formed by the projections 
Oq, a'l, a '2 of the reference points. 

The ratio of the depths of P and Q relative to the plane 77 is simply the ratio 
qyr /qf. Not that deciding which point is actually visible requires orienting the line 
supporting the points p, g, r, which is simply the epipolar line associated with 
the point r' . A coherent orientation should be chosen for all epipolar lines and 
all frames. This is easily done, up to a two- fold ambiguity, using the technique 
described in jS|. 

3.3 Rendering 

Given an input triangulation, the entire scene can now be rendered as follows: 
(1) pick the correct orientation for the epipolar lines (using one of the point 
correspondences and the previous orientation); (2) compute, for each of the data 
points P, the position of r in the reference image and the “depth” pr and store 
it as its “ 2 ” coordinate; (3) render the triangles forming the scenes using a z- 
buffer algorithm with orthographic projection along the z-axis. Texture mapping 
is easily incorporated in the process. It should be noted that this process can 
generate two families of images corresponding to the initial choice of epipolar 
line orientation. The choice can be made by the user during interactive image 
synthesis. 



4 Implementation and Results 

We have implemented the proposed approach and tested it on real data sets. 
The LQBOX data set is kindly provided by Dr. Long Quan from CNRS, the 
TOWER and XLIBOX data sets were acquired by the authors in the Computer 
Vision and Robotics Laboratory at the Beckman Institute, using a Canon XLl 
Digital Camcorder kindly provided by Dr. David Kriegman. 

For completeness, we first present results of point PIV experiments. Along 
with the above data sets we have used the HOUSE data set kindly provided 
by Dr. Carlo Tomasi, the KITCHEN data set kindly provided by the Modeling 
by Videotaping Research Group at the Department of Computer Science of 
Carnegie Mellon University and the FLOWER and FACE data sets acquired by 
the authors. Note that for both point and line features, we have four variants of 
the PIV algorithm, namely, the first and second passes of the weak perspective 
and paraperspective algorithms (or Wl, PI, W2 and P2 in short). Fig.Elshows 
some quantitative results where we have estimated the PIVs for each data set 
using different numbers of images in training. In particular we have used the 
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Image Point Reconstruction Using PIV 





Fig. 4. Average error in image point reconstruction on real data sets: for each data the 
bars from left to right represents the Wl, PI, W2 and P2 methods. In training, from 
top to bottom, the first 25%, 50%, 75% and 100% of the images are used. 



first 25%, 50%, 75%, or 100% of the data in training and used the rest of the 
data as test images to compute the reprojection errors (except in the last case 
where all the data is used in training and testing). Figured! shows synthesized 
images for novel views using point PIVs. 

Fig. 0 shows the mean errors for the reconstruction of images of line features 
for four different methods as it is done for point features above. Note that the 
line position is computed using affine notions only. We have recorded the error 
in reconstruction of the line position in pixels and the line direction in degrees. 

Fig. □ shows the line features with their extents reconstructed for the last 
frame in the TOWER data, with the first half of the images used for training. 
The original lines plotted as dotted lines. 

Finally, Fig. El shows view-synthesis results where we have used the line and 
point PIVs together. More view synthesis results in the form of movies can found 
in http : //www-cvr . ai . uiuc . edu/~ygenc/thesis/ index . html. 
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Fig. 5. Image synthesis for novel views using point features for the FLOWER and 
FACE data sets. 



5 Discussion 

We have presented an integrated method for image-based rendering from point 
and line correspondences established across image sequences, and demonstrated 
an implementation using real images. A very interesting problem that we plan 
to explore is the construction of better meshes from image sequences. This is 
a difficult issue for any image-based rendering technique that does not attempt 
to estimate the camera motion or the actual scene structure, and it is also a 
very important one in practice since rendering a scene from truely arbitrary 
viewpoints requires constructing a mesh covering its whole surface. 
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Discussion 

Yongduek Seo: In the tower movie I see that there are shadows. Would it be 
possible to change the shadow according to the motion? 

Jean Ponce: That would be nice, but I don’t know how to do it. David Krieg- 
man did something like that for fixed illumination and Lambertian objects. In 
principal you could take this funny linear illumination model and put it in, but 
if you wanted to use it for real I think there would be a lot of engineering work 
to do. It’s not clear how much the graphics companies want these things. Yakup 
tells me that building geometric models is not very interesting for them — they 
can just buy a set of scanners and do the job like that. 

Bill Triggs: Following on from the previous talk, your six-dimensional variety 
obviously supports some sort of embedding of Euclidean structure, including the 
Euclidean group motions. So it might be possible to use Lie algebra techniques 
to move yourself around with your joystick. Secondly, you talk about varieties 
but it isn’t clear to me that the global structure is of any use to you — really 
you are looking at just one point on the variety. 

Jean Ponce: I agree on both points. Let me answer the second question first. 
The points are completely independent except when we estimate the structure 
coefficients and put it together. There two ways it could be better. First, instead 
of the three point basis it would be nice to do it some other way. Second, ideally, 
you should take all the data into account at once and we don’t know how yet. 

For the first question on how to use the joystick, yes we could do something 
like that. For example Avidan and Shashua use the trifocal tensor for that, but 
they explicitly put in a representation of the rotation. In our case we didn’t try 
it because we didn’t want either motion or 3D structure. We wanted to work 
only in the image and see what we could do there. Even then, there is still a lot 
of implicit 3D stuff. But yes, if you wanted you could probably go back to some 
Euclidean embedding and make it work. 

Yvan Leclerc: In your experiments, did you manually segment out the tower 
or was it automatic? 

Jean Ponce: We built a simple tracker/segmenter and after that cleaned the 
results by hand. 
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Abstract. This paper addresses the problem of motion recovery from image pro- 
files, in the important case of turntable sequences. No correspondences between 
points or lines are used. Symmetry properties of surfaces of revolution are ex- 
ploited to obtain, in a robust and simple way, the image of the rotation axis of the 
sequence and the homography relating epipolar lines. These, together with geo- 
metric constraints for images of rotating objects, are used to obtain epipoles and, 
consequently, the full epipolar geometry of the camera system. This sequential 
approach (image of rotation axis — homography — epipoles) avoids many of the 
problems usually found in other algorithms for motion recovery from profiles. In 
particular, the search for the epipoles, by far the most critical step for the estimation 
of the epipolar geometry, is carried out as a one-dimensional optimization prob- 
lem, with a smooth unimodal cost function. The initialization of the parameters is 
trivial in all three stages of the algorithm. After the estimation of the epipolar ge- 
ometry, the motion is recovered using the fixed intrinsic parameters of the camera, 
obtained either from a calibration grid or from self-calibration techniques. Results 
from real data are presented, demonstrating the efficiency and practicality of the 
algorithm. 



1 Introduction 

Points and lines have long been used for the recovery of structure and motion from 
images of 3D objects. Nevertheless, for a smooth surface the predominant feature in the 
image is its profile or apparent contour, defined as the projection of a contour generator 
of the surface. A contour generator corresponds to the set of points on a surface where 
the normal vector to the surface is orthogonal to the rays joining the points in the set and 
the camera center (for details, see [3, 4]). If the surface does not have noticeable texture, 
the profile may actually be the only source of information available for estimating the 
structure of the surface and the motion of the camera. 

The problem of motion recovery from image profiles has been tackled in several 
works. The concept of frontier point, defined as a point on a surface tangent to any 
plane of the pencil of epipolar planes related to a pair of images, was introduced in 
[15]. The idea was further developed in [14], where the frontier point was recognized 
as a fixed point on a surface, created by the intersection of two contour generators. A 
frontier point projects on its associated images as an epipolar tangency. The use of 
frontier points and epipolar tangencies for motion recovery was first shown in [2]. A 
parallax based technique, using a reference planar contour was shown in [1], where 
the images are registered using the reference contour and common tangents are used 
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to determine the projections of the frontier point. The techniques described above face 
two main difficulties: the likely non-uniqueness of the solution, due to the presence of 
local minima, and the unrealistic requirement of having at least 7 corresponding epipolar 
tangencies available on each image pair. Better results can be achieved when an affine 
approximation is used, as shown in [1 1]. In this case the problem can be solved when as 
few as 4 epipolar tangencies are available, but the application of the method is constrained 
to situations where the affine approximation is valid. 

In the case of circular motion, the envelope of the profiles exhibits symmetry prop- 
erties that greatly simplify this estimation problem. This is an idea well developed for 
orthographic projection. In [15] it is shown that, when the image plane is parallel to 
the axis of rotation, the image of the axis of rotation will be perpendicular to common 
tangents to the images of the profile. The use of bilateral symmetry to obtain the axis 
of rotation was first introduced in [13]. The condition of parallelism between the image 
plane and the axis of rotation was relaxed in [8], but orthographic projection was still 
used. 

In this paper we introduce a novel technique for the estimation of the motion pa- 
rameters of turntable sequences. It based on symmetry properties of the set of apparent 
contours generated by the object that undergoes the rotation. In Section 2, a method for 
obtaining the images of the axis of rotation and a special vanishing point is presented. 
The algorithm is simple, efficient and robust, and it does not make direct use of the 
profiles. Therefore, its use can be extended to non-smooth objects, and the quality of 
the results obtained justifies doing so. Section 3 makes use of the previous results to in- 
troduce a parameterization of the fundamental matrix based on the harmonic homology. 
This parameterization allows for the estimation of the epipoles to be carried out as inde- 
pendent one-dimensional searches, avoiding local minima points and greatly decreasing 
the computational complexity of the estimation. These results are used in Section 4, 
which presents the algorithm for motion estimation. Experimental results are shown in 
Section 5, and conclusions and future work are described in Section 6. 



2 Theoretical Background 

An object rotating about a fixed axis sweeps out a surface of revolution [8]. Symmetry 
properties [18, 19] of the image of this surface of revolution can be exploited to estimate 
the parameters of the motion of the object in a simple and elegant way, as will be shown 
next. 

2.1 Symmetry Properties of Images of Surfaces of Revolution 

In the definitions that follow, points and lines will be referred to by their representation 
as vectors in homogeneous coordinates. 

A 2D homography that keeps the pencil of lines through a point u and the set of 
points on a line 1 fixed is called a perspective collineation with center u and axis 1. An 
homology is a perspective collineation whose center and axis are not incident (otherwise 
the perspective homology is called an elation). Let a be a point mapped by an homology 
onto a point a'. It is easy to show that the center of the homology u, a and a' are collinear. 
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Let Qa be the line passing through these points, and the intersection of and the 
axis 1. If a and a' are harmonic conjugates with respect to u and v^, i.e., their cross-ratio 
is one, the homology is said to be a harmonic homology (see details in [16,5]). The 
matrix W representing a harmonic homology with center u and axis 1 in homogeneous 
coordinates is given by 



W = I-2^. (1) 

Henceforth a matrix representing a projective transformation in homogeneous coordi- 
nates will be used in reference to the transformation itself whenever an ambiguity does 
not arise. 

An important property of profiles of surfaces of revolution is stated in the next 
theorem; 

Theorem 1. The profile of a surface of revolution S viewed by a pinhole camera is 
invariant to the harmonic homology with axis given by the image of the axis of rotation 
of the surface of revolution and center given by the image of the point at infinity in a 
direction orthogonal to a plane that contains the axis of rotation and the camera center. 

The following lemma will be used in the proof of Theorem 1. 

Lemma 1. Let Tl : F F be a harmonic homology with axis 1 and center u on the 
plane F, and letH : F F' be a bijective 2D homography. Then, the transformation 
W = HTH-i : F' I— F' is a harmonic homology with axis T = H "’’l and center 

u' = Hu. 



Proof Since H is bijective, H ^ exists. Then 



W = H I- 2 



uV 



H 



= 1-2 



uT'T 
u'TR ’ 



since u^l = 



( 2 ) 

□ 



The following corollary is a trivial consequence of Lemma 1 : 

Corollary 1. Let T, H and W be defined as in Lemma 1. The transformation H is a 
isomorphism between the structures (T, F) and (W, F'), i.e, V 7 G F, HT 7 = WH 7 . 

An important consequence of Lemma 1 and Corollary 1 is that if a set of points s, e.g., 
the profile of a surface of revolution, is invariant to a harmonic homology T, the set 
s obtained by transforming s by a 2D projective transformation H is invariant to the 
harmonic homology W = HTH“^. 

Without loss of generality assume that the axis of rotation of the surface of revolution 
S is coincident with the y-axis of an right-handed orthogonal coordinate system. Consid- 
ering a particular case of Theorem 1 where the pinhole camera P is given by P = [I |t ] , 
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where t = [0 0 for any a > 0, symmetry considerations show that the profile s of 
S will be bilaterally symmetric with respect to the image of y (a proof is presented in 
the Appendix 1), which corresponds to the line qs = [10 0]"'^ in (homogeneous) image 
coordinates. 



Proof of Theorem 1 (particular case). Since s is bilaterally symmetrical about qg, there 
is a transformation T that maps each point of s on its symmetrical counterpart, given by 



T = 



-10 0 
0 1 0 
0 0 1 



( 3 ) 



However, as any bilateral symmetry transformation, T is also a harmonic homology, 
with axis qs and center = [1 0 0]"*", since 

T 

T = (4) 

vjqs 

The transformation T maps the set s onto itself (although the points of s are not mapped 
onto themselves by T, but on their symmetrical counterparts), and thus s is invariant to 
the harmonic homology T. Since the camera center lies on the 2 -axis of the coordinate 
system, the plane that contains the camera center and the axis of rotation is in fact the 
yz-plane, and the point at infinity orthogonal to the yz-plane is XJx = [1 0 0 0]^, whose 
image is Vj,. □ 

Let P be an arbitrary pinhole camera. The camera P can be obtained by rotating P about 
its optical center by a rotation R and transforming the image coordinate system of P by 
introducing the intrinsic parameters represented by the matrix K. Let KR = H. Thus, 
P = H[I |t ], and the point in space with the image v^; in P will project as a point 
Ua; = Hva; in P. Analogously, the line qs in P will correspond to a line Ig = H“^qs 
in P. It is now possible to derive the proof of Theorem 1 in the general case. 

Proof of Theorem 1 ( general case ). Let s be the profile of the surface of revolution S 
obtained from the camera P. Thus, the counter-domain of the bijection H acting on the 
profile s is s (or Hs = s), and, using Lemma 1, the transformation W = HTH~^ 
is a harmonic homology with center Uj, = Hva; and axis Ig = H“^qs. Moreover, 
from Corollary 1, WHs = HTs, or Ws = HTs. From the particular case of the 
Theorem 1 it is known that the profile s will be invariant to the harmonic homology T, 
so Ws = Hs = s. □ 



The images of a rotating object are the same as the images of a fixed object taken 
by a camera rotating around the same axis, or by multiple cameras along that circular 
trajectory. Consider any two of such cameras, denoted by P and P'. If P and P' point 
towards the axis of rotation, their epipoles e and e' will be symmetrical with respect to the 
image of the rotation axis, or e' = Te, according to Figure 2. In a general situation, the 
epipoles will simply be related by the transformation e' = We. It is then straightforward 
to show that the corresponding epipolar lines 1 and 1' are related by 1' = W“^l. This 
means that the pair of epipoles can be represented with only two parameters once W is 
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(a) (b) (c) (d) 




(e) 

Fig. 1. Lines joining symmetric points with respect to the image of rotation axis L (images are 
scaled and translated independently for better observation), (a) The optical axis points directly 
towards the rotation axis, (b) The camera is rotated about its optical center by an angle p of 20° 
in a plane orthogonal to the rotation axis, (c) p = 40°. (d) p = 60°. (e) Same as (d), but the 
vanishing point v^; is also shown. 



known. From (2) it can be seen that W has only four degrees of freedom (dof). Therefore, 
the fundamental matrix relating views of an object under circular motion must have only 
6 dof, in agreement with [17]. 



3 Parameterization of the Fundamental Matrix 



3.1 Epipolar Geometry under Circular Motion 

The fundamental matrix corresponding to a pair of cameras related by a rotation around 
a fixed axis has a very special parameterization, as shown in [17, 7]. A simpler derivation 
of this result will be shown here. 

Consider the pair of camera matrices Pi and P 2 , given by 



where 



Pi = [I|t] 

P2=[R,(0)|t], 



t = 0 0 1 and 



Ry{9) 



cos 9 0 sin 9 
0 1 0 
— sin0 0 cos 9 



(5) 



( 6 ) 
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Fig. 2. If the cameras are pointing towards the axis of rotation, the epipoles e and e' are symmetric 
with respect to the image of the axis of rotation. 
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The factor sin 0” can be eliminated since the fundamental matrix is defined only 
up to an arbitrary scale. Assume now that the cameras Pi and P 2 are transformed by a 
rotation R about their optical centers and the introduction of a set of intrinsic parameters 
represented by the matrix K. The new pair of cameras, Pi and P 2 , is related to Pi and 
P 2 by 

Pi = HPi and 

P 2 = HP 2 , (12) 

where H = KR. The fundamental matrix F of the new pair of cameras Pi and P 2 is 
given by 

F = H~^FH”i 

= det(H)[va;]x +tan^(lslh +lhlj), (13) 

where v^; = Huj,, Ih = H“^qh and Ig = H“^qs. 



3.2 Parameterization via Planar Harmonic Homology 

The epipole e' in the image obtained from the camera P 2 in (5) is given by 

, 0 

e = Ua; - tan -u^, (14) 

which can be obtained from (5). The planar harmonic homography T relating the sym- 
metric elements in the stereo camera system Pi and P 2 (e.g. epipoles and pencils of 
epipolar lines) can be parameterized as 

T = I-2^^^. (15) 

uiqs 

Direct substitution of (14) and (15) in (11) shows that the fundamental matrix can be 
parameterized by e' and T as: 

F=[e']xT. (16) 

Again, it is easy to show that the result does not depend on the transformation H, and 

the general result becomes 

0 

F = [e']xW, with e' = Vj, - tan -v^. (17) 

Thus, we have proved that the transformation W corresponds to a plane induced ho- 
mography (see [9]). This means that the registration of the images can be done by using 
W instead of a planar contour as proposed in [1, 6]. It is known that different choices of 
the plane that induces the homography in a plane plus parallax parameterization of the 
fundamental matrix will result in different homographies, although they will all generate 
the same fundamental matrix, since 

F = [e'] X W = [e'] X [W -f e'a^] Va G 



(18) 
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Fig. 3. The harmonic homology is a homography induced by the plane that contains the axis of 
rotation and bisects the segment joining the camera centers. 



The three parameter family of homographies [W +e'a^] has a one to one correspondence 
with the set of planes in K.^. In particular, the homology W relating the cameras Pi and 
E *2 is induced by a plane S that contains the axis of rotation y and bisects the segment 
joining the optical centers of the cameras, as shown in Figure 3. 

4 Algorithms for Motion Recovery 

4.1 Estimation of the Harmonic Homology 

Consider an object that undergoes a full rotation around a fixed axis. The envelope e 
of the profiles is found by overlapping the image sequence and applying a Canny edge 
detector to the resultant image (Figure 4(b)). The homography W is then found by 
sampling N points along e and optimizing the cost function 

N 

/w(Va;, Is) = ^dist(e,W(va;,ls)xj)^, (19) 

i=l 

where dist(e, W(va;, Igjx^) is the distance between the curve e and the transformed 
sample point W(va;, Igjx^. 

The initialization of the line Ig is trivial, and can be made simply by picking a coarse 
approximation for the axis of symmetry of e. This can be done via user intervention or 
by automatically locating one or more pairs of corresponding bitangents. In all practical 
situations, the camera should be roughly pointing towards the rotation axis, which means 
that the point is far (or even at inhnity) and at a direction orthogonal to Ig. The 
estimation of W is summarized in Algorithm 1 . 

4.2 Estimation of the Epipoles 

After obtaining a good estimation of W, one can then search for epipolar tangencies 
between pairs of images in the sequence. Epipolar tangencies are important for motion 
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(b) (c) 



Fig. 4. (a) Image 1, 8, 15 and 22 in the sequence of 36 images of a rotating vase, (b) Envelope of 
apparent contours produced by overlapping all images in the sequence, (c) Initial guess (dashed 
line) and final estimation (solid line) of the image of the rotation axis. 



Algorithm 1 Estimation of the harmonic homology W. 
overlap the images in sequence; 

extract the envelope e of the profiles using Canny edge detector; 
sample N points x; along e; 

initialize the axis of symmetry R and the vanishing point ; 
while not converged do 

transfer the points Xi using W ; 

compute the distances between e and the transferred points; 
update Is and to minimize the function in (19); 

end while 



estimation from profiles since they are the only correspondences that can he established 
between image pairs [2] . To obtain a pair of corresponding epipolar tangencies in two 
images, it is necessary to find a line tangent to one profile which is transferred by to 

a line tangent to the profile in the other image (see Figure 6). The search for corresponding 
tangent lines may be carried out as a one-dimensional optimization problem. The single 
parameter is the angle a that defines the orientation of the epipolar line 1 in the first 
image, and the cost function is given by 



/„ = dist(W-^l(a),i:,(a)) 



( 20 ) 
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Fig. 5. Five images from a single camera and circular motion after a rotation of 10°, 20°, 40° and 
80° are shown in (b), (e), (h) and (k), and the base image at 0° can be seen in (a), (d), (g), (g). The 
epipolar geometry between image pairs is shown. The overlapping of corresponding pairs can be 
seen in (c), (f), (i) and (1). Corresponding epipolar lines intersect at the image of the rotation axis, 
and all epipoles lie on a common horizon. 



where dist(W ^l(a), ly (a)) is the distance between the transferred line 1' = W ^1 
and a parallel line 1J| tangent to the profile in the second image. Typical values of a lie 
between -0.5 rad and 0.5 rad, or —30° and 30°. 

Given a pair of epipolar lines near the top and the bottom of a profile, the epipole 
can be computed as the intersection point of the two epipolar lines, and the fundamental 
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Fig. 6. Corresponding pairs of epipolar tangencies near the top and bottom of two images. 



Algorithm 2 Estimation of the orientation of the epipolar lines, 
extract the profiles of two adjacent images using Canny edge detector; 
fit b-splines to the top and the bottom of the profiles; 
initialize a; 
while not converge do 
find 1, 1' and 1|| ; 

compute the distance between 1' and ly ; 
update a to minimize the function in (20); 

end while 



matrix relating the two cameras follows from (17). Using the camera calibration matrix 
obtained either from a calibration grid or from self-calibration techniques, the essential 
matrix can be found. The decomposition of the essential matrix gives the relative motion 
between two cameras. 

4.3 Critical Configurations 

There is a configuration where the algorithm described in Algorithm 2 fails. Let Mt 
and be subsets of two adjacent apparent contours, with Mt and related by the 
homography W found in Algorithm 1. Any value of a in Algorithm 2 such that the 
resulting epipolar tangencies are in Mt and will minimize the cost function in (19). 
The proof follows from observing that if a is the orientation of a putative epipolar line 
with corresponding epipolar tangency in Mt in the first contour, the mapping of the 
epipolar line tangency via W, as required by Algorithm 2, will result in a line tangent 
to the second contour, as shown in Figure 7. To overcome this problem it is enough then 
to choose another contour as the first one of the pair where the problem appeared, and 
proceed with the algorithm. 
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Fig. 7. If the apparent contours are related by the homography W, there will be multiple solutions 
for the positions of the epipoles. Both pairs (ei, ei') and (02, 62') are valid epipoles, consistent 
with the transformation W (and thus with 1) and the contours s and s'. 



The ultimate degenerate configuration occurs when the surface being viewed is a 
surface of revolution (if not completely, at least in the neighbourhood of the frontier 
points), and the axis of rotation of the turntable is coincident with the axis of rotation 
of the surface (or the axis of rotation of the rotationally symmetric neighbourhoods). In 
this case, all the contours are the same, since the contour generator is a fixed curve in 
space, and the substitution of one contour for another will not make any difference. 

5 Implementation and Experimental Results 

The algorithms described in the previous session were tested using a set of 36 images 
of a vase placed on a turntable (see Figure 4(a)) rotated by an angle of 10° between 
successive snapshots. To obtain W, the Algorithm 1 was implemented with 40 evenly 
spaced sample points along the envelope (N = 40). An approximation for the image 
of the rotation axis was manually picked by observing the symmetry of the envelope. 
This provided an initial guess for Ig. The vanishing point was initialized at infinity, 
at a direction orthogonal to Ig. The cost function (19) was minimized using the BFGS 
algorithm [10]. The initial and final configurations can be seen in Figure 4(c). 

For the estimation of the motion, the Algorithm 2 was applied for pairs of images 
to obtain the essential matrix E. The camera calibration matrix was obtained using a 
calibration grid. The cost function in (20) was minimized using the Golden Section 
method. This optimization problem is rather simple since the cost function is smooth 
and unimodal (see Figure 8). 

The direction of the axis of rotation was initialized as that obtained from the first pair 
of images. The quality of each subsequent estimation was checked by comparing the 
direction of the rotation axis computed from the current pair with the average direction 
found for all the previous pairs. If the deviation was greater than 10°, the motion was 
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Fig. 8. Plot of the cost function (20) for a pair of images in the sequence. (a)/(b) Cost function for 
a pair of corresponding epipolar tangencies near the top/bottom of the profile. 



Algorithm 3 Motion estimation, 
estimate motion between IMAGE(l) and IMAGE(2); 
update the direction of the axis of rotation; 
for i = 3 TO END do 
j =i- 1; 

while motion is bad do 

estimate motion between IMAGE(j) and IMAGE(i); 

j =j - 1; 

end while 

update the direction of the axis of rotation; 

end for 



estimated by using a different combination of images (see Algorithm 3). Such process 
of quality control is completely automatic. 

The remaining problem was to fix the ratio of the norm of the relative translations. 
Since the camera is performing circular motion, it is easy to show that the relative trans- 
lations are proportional to sin 6*/2, where 9 is the angle of the relative rotation between 
the two cameras. The resulting camera configurations are presented in Figure 9(a-c). The 
estimated relative angles between adjacent cameras are accurate, as shown in fig 9(d) 
and the camera centers are virtually on the same plane and the motion closely follows a 
circular path. 

6 Conclusions and Future Work 

This paper introduces a new method of motion estimation by using profiles of a rotat- 
ing object. No affine approximation has been used and only minimal information (two 
epipolar tangencies) is required, as long as the object performs a complete rotation. This 
means that the algorithm can be applied in any practical situation involving circular mo- 
tion. If more information is available, the estimation problem will be more constrained. 
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(c) (d) 

Fig. 9. (a-c) Final configuration of the estimated motion of the cameras, (d) Estimated angles of 
rotation. 



and numerical results can be further improved. By proceeding in a divide-and-conquer 
approach, the difficulties due to initialization and presence of local minima are over- 
come. The search space in the main loop of the algorithm is one-dimensional, making 
the technique highly efficient. 

Some ideas can be explored to further improve the results presented in this work. 
A promising approach is to make simultaneous use of the parameterizations shown 
in (11) and (17). After estimating the position of the epipoles using Algorithm 2, the 
horizon line can be found by fitting a line Ih to the epipoles, such that ijva, = 0. This 
should be done by using a robust method, such as Hough transform or RANSAC. Then, 
Algorithm 2 can be run again, now with the constraint that all the epipoles must lie 
on the horizon line. This procedure constrains the cameras to exactly follow a circular 
path, and integrates information from all images in the estimation of the horizon. This 
approach has already been proved to produce more accurate results, allowing for high 
quality reconstructions [12]. 
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Appendix A: Bilateral Symmetry of Images of Surfaces of 
Revolution 



Let S be the surface of revolution parameterized as 

S = {S{T,(j)) = [f{T)sin(j) g{T) - /(r) cos (r, (/)) G x I^}, (21) 



where / : K D /t Mis a differentiable map for which 3a > 0 such that 0 < f{r) < a 
Vr G Ir, and ^ : M D /r >— M is a differentiable map for which 35, c such that 
5 < g(r) < c\/t G Ir- Also, P + > 0, where / and g are the derivatives of the 

maps / and g. The normal vector at the point S(r, /)) is given by n = x S,- = 
/( t ) \—g sin (j) f g cos (jp, where is the partial derivative of S with respect to the 
variable x- Let P = [I |t] be the matrix of a pinhole camera, with t = [0 0 a]"'" and 
a > a. 

The profile s of S' obtained from P is the projection of the set of points of S where 
(S(r, (/) +t) - n = 0. This constraint can be expressed as gp) f — g f {t) + ag cos </> = 0, 
and for t G It such that pr) ^ 0 the resulting expression for s G s after removing the 
dependence on (p is given by 



s(r) 



fV {agy-igf-gfV 



a^g-f(gf-gf) 



(22) 



Vt such that \{gf — gf)/{ag)\ < 1. Observe that this condition implies that a^g — 
figf - gf) ^ otherwise one would have \{gf - g/)/(ag)| = \a/f \ > 1. From 
(22), one can see that the profile s is bilaterally symmetric about the line qg = [1 0 0]^ 
(observe the sign “3i”). 
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Discussion 

Jean Ponce: This is all very interesting, but given that the motion is circular and the aim 
is to model an object, why not calibrate the camera, at least internally, and maybe the 
turntable too. After all the camera just sits there on its tripod, and you have the turntable 
there. 

Paulo Mendon^a: For the internal parameters I agree, we can calibrate off line and we 
don’t even need the turntable to do it. In fact, in the particular experiment I showed, we 
used an off-line internal calibration, not the one extracted from the harmonic homology. 
But for the external parameters and the motion I’d rather not do that because I like the 
flexibility of not having to rely on a calibration grid. 

Andrew Fitzgibbon: To continue that answer, there were several calibrated turntable 
systems at SIGGRAPH, and every one was knocked at least ten times during the day 
and had to be recalibrated. It was impossible to precalibrate a system there. 

My question is: for your object, do the epipolar tangencies have to be far from the 
rotation axis to get a good estimate of the epipole? 

Paulo Mendon^a: No, not really. The thing I have to avoid is contours that are related to 
one other by the harmonic homology. That doesn’t mean that the epipolar tangencies have 
to be far from the rotation axis. With an irregular object that doesn’t have symmetries 
with itself that are close to the symmetry given by the surface of revolution, there’s no 
problem. 

Andrew Fitzgibbon: OK. Also, in your results it appeared that your camera centres 
were not coplanar. 

Paulo Mendon^a: Yes, as I said the results are preliminary. I haven’t implemented all 
the theory I talked about, such as the bit where the epipoles are constrained to lie on the 
same horizon. What I used is the cloud of points I showed in one of the slides. 

Andrew Fitzgibbon: So you would expect your results to improve. 

Paulo Mendon^a: Oh yes. I’ve improved on that result already. What I haven’t done yet 
is use the full sequence of images to escape from the algorithm’s critical configurations. 
There should be a way of doing that, so that if 1 can’t use a particular pair of images to 
get the angle associated with a camera, I can just jump to another pair. 

Yvan Leclerc: I was wondering if the harmonic homology for this kind of skew symme- 
try you are talking about, would hold even for objects for which the outline is partially 
self-occluding from a certain viewpoint? If you think of a dumb bell you get an occluding 
edge for example. 

Paulo Mendon^a: Yes, we still get the symmetry and it is actually better for the ini- 
tialization of the rotation axis. Self-occlusions give non-smooth points in the profile of 
the surface of revolution. These are good for initialization because they are symmetry 
points that are both accurate and trustworthy. 

Kalle Astrom: Is there a connection between the rotation angles and the epipoles? 

Paulo Mendon^a: Yes. The angle is directly related to the position of the epipole along 
the horizon line. 
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Abstract. The prevailing efforts to study the standard formulation of 
motion and structure recovery have been recently focused on issues of 
sensitivity and robustness of existing techniques. While many cogent ob- 
servations have been made and verified experimentally, many statements 
do not hold in general settings and make a comparison of existing tech- 
niques difficult. With an ultimate goal of clarifying these issues we study 
the main aspects of the problem: the choice of objective functions, op- 
timization techniques and the sensitivity and robustness issues in the 
presence of noise. 

We clearly reveal the relationship among different objective functions, 
such as “(normalized) epipolar constraints” , “reprojection error” or “tri- 
angulation” , which can all be be unified in a new “ optimal triangulation” 
procedure formulated as a constrained optimization problem. Regardless 
of various choices of the objective function, the optimization problems all 
inherit the same unknown parameter space, the so called “essential man- 
ifold” , making the new optimization techniques on Riemanian manifolds 
directly applicable. 

Using these analytical results we provide a clear account of sensitivity 
and robustness of the proposed linear and nonlinear optimization tech- 
niques and study the analytical and practical equivalence of different 
objective functions. The geometric characterization of critical points of 
a function defined on essential manifold and the simulation results clarify 
the difference between the effect of bas relief ambiguity and other types of 
local minima leading to a consistent interpretations of simulation results 
over large range of signal-to- noise ratio and variety of configurations. Q 



1 Introduction 

While the geometric relationships governing the motion and structure recovery 
problem have been long understood, the robust solutions in the presence of 
noise are still sought. New studies of sensitivity of different algorithms, search 
for intrinsic local minima and new algorithms are still subject of great interest. 

The seminal work of Longuet-Higgins 0 on the characterization of the so 
called epipolar constraint, enabled the decoupling of the structure and motion 

^ This work is supported by ARO under the MURI grant DAAH04-96-1-0341 
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problems and led to the development of numerous linear and nonlinear algo- 
rithms for motion estimation (see [1 41712 1| for overviews). The appeal of linear 
algorithms which use the epipolar constraint (in the discrete case [21 ITIHIId] and 
in the differential case EESl) is the closed form solution to the problem which, 
in the absence of noise, provides true estimate of the motion. However, a further 
analysis of linear techniques revealed an inherent bias in the translation esti- 
mates m The sensitivity studies of the motion estimation problem have been 
done both in an analytical [XHI and experimental setting m and revealed the 
superiority of the nonlinear optimization schemes over the linear ones. Numerous 
nonlinear optimization schemes differed in the choice of objective functions 12 ill , 
different parameterizations of the unknown parameter space and means 

of initialization of the iterative schemes (e.g. monte-carlo simulations [21 7| . or 

linear techniques jS!)- In most cases, the underlying search space has been pa- 
rameterized for computational convenience instead of being loyal to its intrinsic 
geometric structure. Algebraic manipulation of intrinsic geometric relationships 
typically gave rise to different objective functions, making the comparison of the 
performance of different techniques inappropriate and often obstructing the key 
issues of the problem. The goal of this paper is to evaluate intrinsic difficulties 
of the structure and motion recovery problem in the presence of large levels of 
noise, in terms of intrinsic local minima, bias, sensitivity and robustness. This 
evaluation is done with respect to the choice of objective function and opti- 
mization technique, in the simplified two-view, point-feature scenario. The main 
contributions presented in this paper are summarized briefly below: 



1. We present a new optimal triangulation procedure and show that it can be 
formulated as an iterative two step constrained optimization: Motion estima- 
tion is formulated as optimization on the essential manifold and is followed 
by additional well conditioned minimization of two Raleigh quotients for esti- 
mating the structure. The procedure clearly reveals the relationship between 
existing objective functions used previously and exhibits superior (provable) 
convergence properties. This is possible thanks to the intrinsic nonlinear 
search schemes on the essential manifold, utilizing Riemanian structucture 
of the unknown parameter space. 

2. We demonstrate analytically and by extensive simulations how the choice of 
the objective functions and configurations affects the sensitivity and robust- 
ness of the estimates, making a clear distinction between the two. We both 
observe and geometrically characterize how the patterns of critical points 
of the objective function change with increasing levels of noise for general 
configurations. We show the role of linear techniques for initialization and 
detection of these incorrect local minima. Further more we utilize the sec- 
ond order information to characterize the nature of the bas relief ambiguity 
and rotation and translation confounding for special class of “sensitive” mo- 
tions/conflgurations . 



Based on analytical and experimental results, we will give a clear profile of the 
performance of different algorithms over a large range of signal-to-noise ratio, 
and under various motion and structure configurations. 
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2 Optimization on the Essential Manifold 

Suppose the camera motion is given by {R, S) £ SE{3) (the special Euclidean 
group) where i? is a rotation matrix in SO(3) (the special orthogonal group) and 
S € is the translation vector. The intrinsic geometric relationship between 
two corresponding projections of a single 3D point in two images p and q (in 
homogeneous coordinates) then gives the so called epipolar constraint 0 : 

p^RSq = (1) 

where S G is defined such that Sv = S x v for all v G Epipolar 

constraint decouples the problem of motion recovery from that of structure re- 
covery. The first part of this paper will be devoted to recovering motion from 
directly using this constraint or its variations. In Section 0 we will see how this 
constraint has to be adjusted when we consider recovering motion and structure 
simultaneously. 

The entity of our interest is the matrix RS in the epipolar constraint; the so 
called essential matrix. The essential manifold is defined to be the space of all 
such matrices, denoted hy £ = {RS \ R G SO{3),S G so(3)}, where 50(3) is a 
Lie group of 3 x 3 rotation matrices, and so(3) is the Lie algebra of 50(3), i.e., 
the tangent plane of 50(3) at the identity. so(3) then consists of all 3 x 3 skew- 
symmetric matrices. The problem of motion recovery is equivalent to optimizing 
functions defined on the so called normalized essential manifold: 

£i = {RS I R G 50(3), 5 G so(3), ^tr{S'^S) = 1}. 

Note that ^tr(S^S) = 5^5. In order to formulate properly the optimization 
problem, it is crucial to understand the Riemannian structure of the normalized 
essential manifold. In our previous work we showed El that the space of es- 
sential matrices can be identified with the unit tangent bundle of the Lie group 
50(3), i.e., Ti(5O(3))0. Further more its Riemannian metric g induced from 
the bi-invariant metric on SO (3) is the same as that induced from the Euclidean 
metric with Ti(50(3)) naturally embedded in IR^^"^. (Ti(50(3)),g) is the prod- 
uct Riemannian manifold of (50(3), 51 ) and (S^,g 2 ) with gi and 52 canonical 
metrics for 50(3) and as Stiefel manifolds. Given this Riemannian struc- 
ture of our unknown parameter space, we showed m that one can generalize 
Edelman et aFs methods ^ to the product Riemannian manifolds and obtain in- 
trinsic geometric Newton’s or conjugate gradient algorithms for solving such an 
optimization problem. Given the epipolar constraint, the problem of motion re- 
covery R, S from a given set of image correspondences Pi,qi G IR^, i = 1, ... ,N, 
in the presence of noise can be naturally formulated as a minimization of the 

^ However, the unit tangent bundle Ti(50(3)) is not exactly the normalized essential 
manifold £ 1 . It is a double covering of the normalized essential space £ 1 , i.e., £1 = 
Ti(SO(3})/^^ (for details see [TT|'I. 
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following objective function: 

N 

F{R,S) = Y,{pjRSq,)^ (2) 

for Pi, Qi S where F{R, S) is a function defined on Ti{SO{3)) = SO{S) x 
with R G SO{3) represented by a 3 x 3 rotation matrix and S' S a vector of 
unit length in Due to the lack of space below we present only a summary 
of the Newton’s algorithm for optimization of the above objective function on 
the essential manifold. Please refer for more details to for this particular 
objective function and to 0 for the details of the Newton’s or other conjugate 
gradient algorithms for general Stiefel or Grassmann manifolds. 

Riemannian Newton’s algorithm for minimizing F{R,S): 

1. At the point {R,S), 

— Compute the gradient G = (P/j — RF^R, Fs — SFg S), 

— Compute A = — Hess~^G. 

2. Move {R, S) in the direction A along the geodesic to (exp(R, Z\i), exp(S, A 2 )). 

3. Repeat if ||G|| > e for pre- determined e > 0. 

Fr{Fs) is a derivative of the objective function F{R,S) with respect to its 
parameters. 

The basic ingredients of the algorithm is the computation of the gradient 
and Hessian whose explicit formulas can be found in unj. These formulas can 
be alternatively obtained by directly using the explicit expression of geodesics 
on this manifold. On SO{3), the formula for the geodesic at R in the direction 
^1 C Th{SO{3)) = i?*(so(3)) is R{t) = exp(i?, AR) = Rexpujt = R{I+ujsint+ 
0^(1 — cost)), where t G M, w = R^ Ai G so(3). The last equation is called the 
Rodrigues’ formula (see [1 6]). (as a Stiefel manifold) also has very simple 
expression of geodesics. At the point S along the direction A 2 G Ts(S^) the 
geodesic is given by S{t) = exp{S, A 2 t) = Scosat+ U sintrt, where cr = HA 2 II 
and U = A 2 I 0 , then S'^U = 0 since A 2 = 0. Using these formulae for 
geodesics , we can calculate the first and second derivatives of F(R, S) in the 
direction A — (Ai,A 2 ) G Tr(S'0(3)) x Ts{S'^). The explicit formula for the 
Hessian obtained in this manner plays an important role for sensitivity analysis 
of the motion estimation as we will point out in the second part of the 
paper. Furthermore, using this formula, we have shown that the conditions 
when the Hessian is guaranteed non-degenerate are the same as the conditions 
for the linear 8-point algorithm having a unique solution; whence the Newton’s 
algorithm has quadratic rate of convergence. 

2.1 Minimizing Normalized Epipolar Constraints 

Although the epipolar constraint ([5 gives the only necessary (depth indepen- 
dent) condition that image pairs have to satisfy, motion estimates obtained from 
minimizing the objective function ()2Il are not necessarily statistically or geomet- 
rically optimal for the commonly used noise model of image correspondences. In 
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general, in order to get less biased estimates, we need to normalize (or weight) 
the epipolar constraints properly, which has been initially observed in |‘22| . In 
this section, we will give a brief account of these normalized versions of epipolar 
constraints. In the perspective projection case0, coordinates of image points p 
and q are of the form p = G and q = (q^,q'^,l)^ G Sup- 

pose that the actual measured image coordinates of N pairs of image points 
are: Pi = Pi + Xt, qi = qi + Vi for i = 1, ... ,7V, where Pi and (ji are ideal 
(noise free) image coordinates, Xi = (xl,x‘f,0)'^ G pi = (?/j^,yf,0)^ G IR^ 
and x\,x^, y } , yf are independent Gaussian random variables of identical dis- 
tribution 7V(0, cr^). Substituting pi and qi into the epipolar constraint (^, we 
obtain: 

pjRSqi = xj RSqi + pj RS Pi + xf RSyt. 



Since the image coordinates pi and qi usually are magnitude larger than Xi and j/j, 
one can omit the last term in the equation above. Then pf RSqi are independent 
random variables approximately of Gaussian distribution N{0, a'^dle^RSqiW^ -I- 
\\pf RScsW"^)), where 63 = (0,0,1)"^ G If we assume the a prior distribution 
of the motion (i?, S) is uniform, the maximum a posterior (MAP) estimates of 
(i?, S) is then the global minimum of the objective function: 



Fs{R,S) 



E 



Z=1 



{pjRSqif 

\\hRSq^\\'^ + WpjRSesW^ 



( 3 ) 



for Pi,qi G IR^,(i?, S') G SO(3) X S^. We here use Fg to denote the statisti- 
cally normalized objective function associated with the epipolar constraint. This 
objective function is also referred in the literature under the name gradient crite- 
ria or epipolar improvement. Therefore, we have (i?, S)map ~ argminFs(i?, S). 
Note that in the noise free case, Fg achieves zero, just like the unnormalized ob- 
jective function F of equation Asymptotically, MAP estimates approach the 
unbiased minimum mean square estimates (MMSE). So, in general, the MAP 
estimates give less biased estimates than the unnormalized objective function 
F. Note that Fg is still a function defined on the manifold 50(3) x S^. Another 
commonly used criteria to recover motion is to minimize the geometric distances 
between image points and corresponding epipolar lines. This objective function 
is given as: 



^P\S,RSq,r 



{pfRSqg)^ 

WpfRSejW^ 



( 4 ) 



for Pi,qi G (i?, S) G SO{3) X S^. We here use Fg to denote this geometrically 
normalized objective function. Notice that, similar to F and fg, Fg is also a 
function defined on the essential manifold and can be minimized using the given 
Newton’s algorithm. As we know from the differential case H2|, the normaliza- 
tion has no effect when the translational motion is in the image plane, i.e., the 

® The spherical projection case is similar and is omitted for simplicity. 
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unnormalized and normalized objective functions are in fact equivalent. For the 
discrete case, we have similar claim . Therefore in such case the normalization 
will have very little effect on motion estimation as will be verified by simulation. 



3 Optimal Triangulation 



Note that, in the presence of noise, for the motion (i?, S) recovered from mini- 
mizing the unnormalized or normalized objective functions F, Fg or Fg, the value 
of the objective functions is not necessarily zero. Consequently, if one directly 
uses Pi and qi to recover the 3D location of the point to which the two images pi 
and Qi correspond, the two rays corresponding to pi and qi may not be coplanar, 
hence may not intersect at one 3D point. Also, when we derived the normalized 
epipolar constraint Fg, we ignored the second order terms. Therefore, rigorously 
speaking, it does not give the exact MAP estimates. Under the assumption of 
Gaussian noise model, in order to obtain the optimal (MAP) estimates of cam- 
era motion and a consistent 3D structure reconstruction, in principle we need to 
solve the following optimal triangulation problem: Seek camera motion (i?, S) 
and points Pi,qi € IR^ on the image plane such that they minimize the distance 
from Pi and qp. 

N 

Ft{R,S,p^,qi) = ^ Wpi-pif + \\qi - qiW^ (5) 

subject to the conditions: pf RSpi = 0, pf 63 = 1, gf 63 = 1 for f = 1, . . . , iV. 
We here use Ft to denote the objective function for triangulation. This objective 
function is also referred in literature as the reprojection error. Unlike 0, we 
do not assume a known essential matrix RS. Instead we seek Pi,qi and (R,S) 
which minimize the objective function Ft given by (0. The objective function Ft 
then implicitly depends on the variables (i?, S) through the constraints. Clearly, 
the optimal solution to this problem is exactly equivalent to the optimal MAP 
estimates of both motion and structure. Using Lagrangian multipliers, we can 
convert the minimization problem to an unconstrained one: 

N 

,,min^ V \\pi - p^\\‘^ + 11 % - + XipjRSqi + Pi{pf 63 - 1 ) -b J^{q '[ 63 - 1 ). 

R,S,pi,qi f 
i—l 

The necessary conditions for minima of this objective function are: 

2(Pz - Pi) + XiRSqi + Pi€3 = 0 (6) 

2(% - qz) + XzS'^R^Pi + 7iC3 = (7) 



From necessary conditions we get Pi, qi. Substituting these and Xi obtained from 
back to into Ft we get: 



Ft{R,S,Pi,qi) = 



N 

E 



{pfRSq^ +pfRSqi)^ 
|le3i?5%||2 -b WpjRSe^W^ 



( 8 ) 
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and alternatively using © for Xi instead, we get: 



N 

Ft{R,S,p,,qt) = ^ 
2 = 1 



{pjRSqiY 

WesRSqPl^ 



{Pj RSqiY 
WPjRSelr' 



( 9 ) 



Geometrically, both expressions of Ft are the distances from the image points 
Pi and qi to the epipolar lines specified by pi, pi and (R,S). Equations Q and 
(0 give explicit formulae of the residue of \\pi — piW^ + \\qi — qiW^ as pi,qt being 
triangulated by pi, pi- Note that the terms in Ft are normalized crossed epipolar 
constraints between pi and pi or between pi and qi. These expressions of Ft can 
be further used to solve for {R, S) which minimizes Ft- This leads to the following 
iterative scheme for obtaining optimal estimates of both motion and structure, 
without explicitly introducing scale factors (or depths) of the 3D points. 
Optimal Triangulation Algorithm Outline: The procedure for minimizing 
Ft ean be outlined as follows: 

1. Initialize p*{R, S),p*{R, S) as pi,qi. 

2. Motion: Update {R, S) by minimizing Ff (R, S) = Ft {R, S, p* {R, S) , p* {R, S ) ) 
given by (0 or m as a function defined on the manifold S'0(3) x S^. 

3. Structure (Triangulation): Solve for p*{R,S) and p*{R,S) which min- 
imize the objective function Ft with respect to (R, S) computed in the 
previous step. 

4-. Back to step 2 until updates are small enough. 

At step 3, for a fixed (R,S), p*{R,S) and p*{R,S) can be computed by 
minimizing the distance \\pi —PiW^ + \\pi — for each pair of image points. Let 
ti € be the normal vector (of unit length) to the (epipolar) plane spanned 
by (pi, S). Given such a ti, pi and pi are determined by: 



Pi(ti) = 



eat'/t ei p. 



t\t\e3 



t'iea 



Pi(ti) = 



e3titfe^qi-\-t[Ue3 



where t'i = RU. Then the distance can be explicitly expressed as: 



where 



\\p^-q^r+\\p^-p^r = \\q^r 



tf Ajtj 
tf B^ti 



lb.ll 



t'Tou'/ 



Ai = I - (ezq^qjel + (fes + espi), B., = el’e'3 
Ci = I - (e^PipJe^ + PiCs + esPi), Di = efea ’ 



( 10 ) 



( 11 ) 



Then the problem of finding p* (R, S) and p* (R, S) becomes one of finding t* 
which minimizes the function of a sum of two singular Rayleigh quotients'. 



min 

tf S=0,tJ ti=l 



V(ti) 



tf Ajtj tfR^CjRtj 

tfB,U tfR^D.RU ■ 



( 12 ) 
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This is an optimization problem on a unit circle in the plane orthogonal to 
the vector S. If ni , n2 G are vectors such that S, rii , U 2 form an orthonormal 
basis of then ti = cos(6*)ni + sin(0)n2 with 0 G M. We only need to find 9* 
which minimizes the function V(ti{9)). From the geometric interpretation of the 
optimal solution, we also know that the global minimum 9* should lie between 
two values: 9i and 6*2 such that ti{9i) and ^^(6*2) correspond to normal vectors 
of the two planes spanned by (qi,S) and respectively are al- 

ready triangulated, these two planes coincide). Therefore, in our approach the 
local minima is no longer an issue for triangulation, as oppose to the method pro- 
posed in ^ . The problem now becomes a simple bounded minimization problem 
for a scalar function and can be efficiently solved using standard optimization 
routines (such as “fmin” in Matlab or the Newton’s algorithm). If one properly 
parameterizes ti{9), t* can also be obtained by solving a 6-degree polynomial 
equation, as shown in ^ (and an approximate version results in solving a 4- 
degree polynomial equation IZP)- However, the method given in ^ involves 
coordinate transformation for each image pair and the given parameterization is 
by no means canonical. For example, if one chooses instead the commonly used 
parameterization of a circle S^: sin(20) = cos(20) = A G IR, 

then it is straightforward to show from the Rayleigh quotient sum C2I) that the 
necessary condition for minima of V(ti) is equivalent to a 6-degree polynomial 
equation in A0 The triangulated pairs {pi,(ji) and the camera motion {R, S) 
obtained from the minimization automatically give a consistent (optimal) 3D 
structure reconstruction by two-frame stereo. 

In the expressions of Ft given by (18) or (19), if we simply approximate 
Pi^qi by Pi,qi respectively, we may obtain the normalized versions of epipolar 
constraints for recovering camera motion. Although subtle difference between 
Fs^ Fg and Ft has previously been pointed out in m, our approach discovers 
that all these three objective functions can be unified in the same optimization 
procedure - they are just slightly different approximations of the same objective 
function F)*. Practically speaking, using either normalized objective function Fg 
or Fg, one can already get camera motion estimates which are very close to the 
optimal ones. This will be demonstrated by extensive simulations in the next 
section. 



4 Critical Values and Ambiguous Solutions 

We devote the remainder of this paper to study of the robustness and sensitivity 
of motion and structure estimation problem in the presence of large levels of 
noise. We emphasize here the role of the linear techniques for initialization and 
utilize the characterization of the space of essential matrices and the intrinsic 
optimization techniques on the essential manifold. The focus of our robustness 

^ Since there is no closed form solution to 6-degree polynomial equations, directly 
minimizing the Rayleigh quotient sum (C2» avoids unnecessary transformations hence 
can be much more efficient. 
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study deals with the appearance of new local minima. Like any nonlinear sys- 
tem, when increasing the noise level, new critical points of the objective function 
can be introduced through bifurcation. Although in general an objective func- 
tion could have numerous critical points, numbers of different types of critical 
points have to satisfy the so called Morse inequalities, which are associated to 
topological invariants of the underlying manifold (see m)- Key to this study 
is the computation of the Euler characteristic x(M) of the underlying manifold 
S'0(3) X KP^ which is in this case 0; x{SO{i) x PP^) = 0. Euler characteristic 
is equal to where D\ is the dimension of the homology group 

H\{M,]K) of M over any field IK, the so called Betti number. In our case 
D\ = 1, 2, 3, 3, 2, 1 for A = 0, 1, 2, 3, 4, 5 types of critical points respectively. For 
details of this computation see PI- Among all the critical points, those belong- 
ing to type 0 are called (local) minima, type n are (local) maxima, and types 1 
to n — 1 are saddles. From the above computation any Morse function defined 
on S'0(3) X PP^ must have all three kinds of critical values. The nonlinear 
search algorithms proposed in the above are trying to find the global minimum 
of given objective functions. We study the effect of initialization by linear tech- 
niques and apperance of new critical points on different slices of the nonlinear 
objective function which we can be easily visualized. The choice of the section is 
determined by the estimate of rotation where the nonlinear algorithm converged 
by initialization of the linear algorithm. 

Rewriting the epipolar constraint as pf Eqi = 0,i = 1, . . . , N, minimizing the 
objective function F is (approximately) equivalent to the following least square 
problem min ||Ae|p, where A is a fV x 9 matrix function of entries of Pi and qi, 
and e G P® is a vector of the nine entries of E. Then e is the (usually one dimen- 
sional) null space of the 9x9 symmetric matrix A^A. In the presence of noise, 
e is simply chosen to be the eigenvector corresponding to the least eigenvalue of 
A^A. At a low noise level, this eigenvector in general gives a good initial esti- 
mate of the essential matrix. However, at a certain high noise level, the smallest 
two eigenvalues may switch roles, as do the two corresponding eigenvectors - 
topologically, a bifurcation as shown in Figure occurs. This phenomena is very 
common in the motion estimation problem: at a high noise level, the translation 
estimate may suddenly change direction by roughly 90°, especially in the case 
when translation is parallel to the image plane. We will refer to such estimates 
as the second eigenmotion. A similar situation for the differential case and small 
field of view has previously been reported in | 2 |. 

Figure Q] and 0 demonstrate such a sudden appearance of the second eigen- 
motion. They are the simulation results of the proposed nonlinear algorithm of 
minimizing the function Fs for a cloud of 40 randomly generated pairs of image 
correspondences (in a field of view 90°, depth varying from 100 to 400 units of fo- 
cal length.). Gaussian noise of standard deviation of 6.4 or 6.5 pixels is added on 
each image point (image size 512 x 512 pixels). To make the results comparable, 
we used the same random seeds for both runs. The actual rotation is 10° about 
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the y-axis and the actual translation is along the X-axis0 The ratio between 
translation and rotation is 20 In the figures, “+” marks the actual translation, 
marks the translation estimate from linear algorithm (see m for detail) 
and “o” marks the estimate from nonlinear optimization. Up to the noise level 
of 6.4 pixels, both rotation and translation estimates are very close to the actual 
motion. Increasing the noise level further by 0.1 pixel, the translation estimate 
suddenly switches to one which is roughly 90° away from the actual translation. 
Geometrically, this estimate corresponds to the second smallest eigenvector of 
the matrix A as we discussed before. Topologically, this estimate corresponds 
to the local minimum introduced by a bifurcation as shown by Figure 0| Clearly, 
in Figure Q there is 1 maximum, 1 saddle and 1 minimum on IRP^; in Figure 
El there is 1 maximum, 2 saddles and 2 minima. Both patterns give the Euler 
characteristic of KP^ as 1. Rotation is fixed at the estimate from nonlinear 
algorithm. The errors are expressed in terms of canonical metric on 5'0(3) for 
rotation and in terms of angle for translation. 




Fig. 1. Value of objective 
function if, for all S at noise 
level 6.4 pixels. Estimation 
errors: 0.014 in rotation es- 
timate and 2.39° in transla- 
tion estimate. 



Fig. 2. Value of objective 
function Fs for all S at noise 
level 6.5 pixels. Estimation 
errors: 0.227 in rotation es- 
timate and 84.66° in trans- 
lation estimate. 



Fig. 3. Bas relief ambigu- 
ity. FOV is 20°, points 
depths vary from 100 to 150 
units of focal length, rota- 
tion magnitude is 2°, T/R 
ratio is 2. 20 runs with noise 
level 1.3 pixels. 



From the Figure El we can see that the the second eigenmotion ambiguity 
is even more likely to occur (at certain high noise level) than the other local 
minimum marked by “o” in the figure which is a legitimate estimate of the actual 
one. These two estimates always occur in pair and exist for general configuration 
even when both the FOV and depth variation are sufficiently large. We propose 
a way for resolving the second eigenmotion ambiguity by linear algorithm which 
is used for initialization. An indicator of the configuration being close to critical 

® We here use the convention that V-axis is the vertical direction of the image and 
A-axis is the horizontal direction and the Z-axis coincides with the optical axis of 
the camera. 

® Rotation and translation magnitudes are compared with respect to the center of the 
cloud of 3D points generated. 
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is the ratio of the two smallest eigenvalues of A'^A tjg and as . By using both 
eigenvectors Vg and Vg for computing the linear motion estimates, the one which 
satisfies the positive depth constraint by larger margin (i.e. larger number of 
points satisfies the positive depth constraint) leads to the motion estimates closer 
to the true one (see |H| for details). 

This second eigenmotion effect has a quite different interpretation as the one 
which was previously attributed to the bas relief ambiguity. The bas relief effect 
is only evident when FOV and depth variation is small, but the second eigen- 
motion ambiguity may show up for general configurations. Bas relief estimates 
are statistically meaningful since they characterize a sensitive direction in which 
translation and rotation are the most likely to be confound. The second eigen- 
motion, however, is not statistically meaningful: it is an effect of initialization 
which with increasing noise level causes a perturbation to a different slice of the 
objective function with a different topology of the residual. This effect occurs 
only at a high noise level and this critical noise level gives a measure of the 
robustness of linear initialization of the given algorithm. For comparison. Figure 
01 demonstrates the effect of the bas relief ambiguity: the long narrow valley of 
the objective function corresponds to the direction that is the most sensitive 
to noisefl Translation is along the X-axis and rotation around the Y-axis. The 
(translation) estimates of 20 runs, marked as “o”, give a distribution roughly 
resembling the shape of this valley - the actual translation is marked as “-|-”in 
the center of the valley which is covered by circles. 



5 Experiments and Sensitivity Analysis 

In this section, we clearly demonstrate by experiments the relationship among 
the linear algorithm (as in |l 4j). nonlinear algorithm (minimizing F), normalized 
nonlinear algorithm (minimizing Fg) and optimal triangulation (minimizing Ft). 
Due to the nature of the second eigenmotion ambiguity (when not corrected), 
it gives statistically meaningless estimates. Such estimates should be treated 
as “outliers” if one wants to properly evaluate a given algorithm and compare 
simulation results. We will demonstrate that seemingly conflicting statements in 
the literature about the performance of existing algorithms can in fact be given 
a unified explanation if we systematically compare the simulation results with 
respect to a large range of noise levels (as long as the results are statistically 
meaningful) . 

The following simulations were carried out with the points in general config- 
uration and camera parameters described in Section 4. All nonlinear algorithms 
are initialized by the estimates from the standard 8-point linear algorithm (see 
0), instead of from the ground truth. The criteria for all nonlinear algorithms 
to stop are: (a) The norm of gradient is less than a given error tolerance, which 



^ This direction is given by the eigenvector of the Hessian associated with the smallest 
eigenvalue. 
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usually we pick as 10 ® unless otherwise stated0 and (b) The smallest eigenvalue 
of the Hessian matrix is positively 



Axis Dependency Profile It has been well known that the sensitivity of 
the motion estimation depends on the camera motion. However, in order to 
give a clear account of such a dependency, one has to be careful about two 
things: 1. The signal-to-noise ratio and 2. Whether the simulation results are 
still statistically meaningful while varying the noise level. Figure ^ El El and 
0 give simulation results of 100 trials for each combination of translation and 
rotation (“T-R”) axes, for example, ‘‘‘‘X-Y” means translation is along the X- 
axis and the rotation axis is the F-axis. Rotation is always 10° about the axis 
and the T/R ratio is 2. In the figures, “linear” stands for the standard 8-point 
linear algorithm; “nonlin” is the Riemannian Newton’s algorithm minimizing 
the epipolar constraints F, “normal” is the Riemannian Newton’s algorithm 
minimizing the normalized epipolar constraints Fg. 




Fig. 4. Axis dependency: estimation 
errors in rotation and translation at 
noise level 1.0 pixel. T/R ratio = 2 and 
rotation = 10°. 




Fig. 5. Axis dependency: estimation 
errors in rotation and translation at 
noise level 3.0 pixels. T/R ratio = 2 
and rotation = 10°. 



By carefully comparing the simulation results in Figure 00 El and E] we can 
draw the following conclusions: 

1. Optimization Techniques (linear vs. nonlinear) 

(a) Minimizing F in general gives better estimates than the linear algorithm at 
low noise levels (Figure 0 and 0. At higher noise levels, this is no longer true 
(Figure 0 and |7J, due to the more global nature of the linear technique. 

(b) Minimizing the normalized Fa in general gives better estimates than the linear 
algorithm at moderate noise levels (all figures). 

® Our current implementation of the algorithms in Matlab has a numerical accuracy 
at 10“®. 

® Since we have the explicit formulae for Hessian, this condition would keep the algo- 
rithms from stopping at saddle points. 
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Fig. 6. Axis dependency: estimation 
errors in rotation and translation at 
noise level 5.0 pixel. T/R ratio = 2 and 
rotation = 10°. 




Fig. 7. Axis dependency: estimation 
errors in rotation and translation at 
noise level 7.0 pixels. T/R ratio = 2 
and rotation = 10°. 



2. Optimization Criteria (F vs. Fg) 

(a) At relatively low noise levels (Figure 01, normalization has little effect when 
translation is parallel to the image plane; and estimates are indeed improved 
when translation is along the ^-axis. 

(b) However, at moderate noise levels (Figure 00 and 0, when translation is along 
the Z-axis, little improvement can be gained by minimizing Fs instead of F; 
however, when translation is parallel to the image plane, F is more sensitive 
to noise and minimizing the statistically less biased Fs consistently improves 
the estimates. 

3. Axis Dependency 

(a) All three algorithms are the most robust to the increasing of noise when the 
translation is along Z. At moderate noise levels (all figures), their performances 
are quite close to each other. 

(b) Although, at relatively low noise levels (Figure 0 El and EJ, estimation errors 
seem to be larger when the translation is along the Z-axis, estimates are in 
fact much less sensitive to noise and more robust to increasing of noise in this 
case. The larger estimation error in case of translation along Z-axis is because 
the displacements of image points are smaller than those when translation is 
parallel to the image plane, thus the signal-to-noise ratio is in fact smaller. 

(c) At a noise level of 7 pixels (FigureQ), estimation errors seem to become smaller 
when the translation is along Z-axis. This is due to the fact that, at a noise 
level of 7 pixels, the second eigenmotion ambiguity already occurs in some of 
the trials when the translation is parallel to the image plane. 

The second statement about the axis dependency supplements the observa- 
tion given in ED]. In fact, the motion estimates are both robust and less sensitive 
to increasing of noise when translation is along the Z-axis. For a fixed base line, 
high noise level results resemble those for a smaller base line at a moderate noise 
level. Figure Q is therefore a generic picture of the axis dependency profile for 
the differential or small base-line case (for more details see [11 2]V 

Non-iterative vs. Iterative In general, the motion estimates obtained from 
directly minimizing the normalized epipolar constraints Fg or Fg are already 
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Fig. 8. Estimation errors of rotation 
(in canonical metric on 5'0(3)). 50 
trials, rotation 10 degree around Y- 
axis and translation along X-axis, 
T/R ratio is 2. Noises range from 0.5 
to 5 pixels. 




Fig. 9. Estimation errors of transla- 
tion (in degree). 50 trials, rotation 10 
degree around Y -axis and translation 
along X-axis, T/R ratio is 2. Noises 
range from 0.5 to 5 pixels. 




Fig. 10. Estimation errors of rota- 
tion (in canonical metric on SO{3)). 
40 points, 50 trials, rotation 10 de- 
gree around F-axis and translation 
along Z-axis, T/R ratio is 2. Noises 
range from 2.5 to 20 pixels. 



Fig. 11. Estimation errors of transla- 
tion (in degree). 40 points, 50 trials, 
rotation 10 degree around F-axis and 
translation along Z-axis, T /R ratio is 
2. Noises range from 2.5 to 20 pixels. 



very close to the solution of the optimal triangulation obtained by minimizing 
Ft iteratively between motion and structure. It is already known that, at low 
noise levels, the estimates from the non-iterative and iterative schemes usually 
differ by less than a couple of percent m- 

By comparing the simulation results in Figures!^ F)l 1 1 1)1 a.nd EH we can draw 
the following conclusions: 

1. Although the iterative optimal triangulation algorithm usually gives better esti- 
mates (as it should), the non-iterative minimization of the normalized epipolar 
constraints Fs or Fg gives motion estimates with only a few percent larger er- 
rors for all range of noise levels. The higher the noise level, the more evident the 
improvement of the iterative scheme is. 

2. Within moderate noise levels, normalized nonlinear algorithms consistently give 
significantly better estimates than the standard linear algorithm, especially when 
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the translation is parallel to the image plane. At very high noise levels, the perfor- 
mance of the standard linear algorithm, out performs nonlinear algorithms. This 
is due to the more global nature of the linear algorithm. However, such high noise 
levels are barely realistic in real applications. 

For low level Gaussian noises, the iterative optimal triangulation algorithm 
gives the MAP estimates of the camera motion and scene structure, the esti- 
mation error can be shown close to the theoretical error bounds, such as the 
Cramer-Rao bound. This has been shown experimentally in m- Consequently, 
minimizing the normalized epipolar constraints Fg or Fg gives motion estimates 
close to the error bound as well. 

6 Discussions and Futnre Work 

Although previously proposed algorithms already have good performance in 
practice, the geometric concepts behind them have not yet been completely 
revealed. The non-degeneracy conditions and convergence speed of those al- 
gorithms are usually not explicitly addressed. Due to the recent development 
of optimization methods on Riemannian manifolds, we now can have a better 
mathematical understanding of these algorithms, and propose new geometric al- 
gorithms or filters, which exploit the intrinsic geometric structure of the motion 
and structure recovery problem. As shown in this paper, regardless of the choice 
of different objectives, the problem of optimization on the essential manifold is 
common and essential to the optimal motion and structure recovery problem. 
Furthermore, from a pure optimization theoretic viewpoint, most of the objective 
functions previously used in the literature can be unified in a single optimiza- 
tion procedure. Consequently, “minimizing (normalized) epipolar constraints”, 
“triangulation”, “minimizing reprojection errors” are all different (approximate) 
versions of the same simple optimal triangulation algorithm. 

In this paper, we have studied in detail the problem of recovering a discrete 
motion (displacement) from image correspondences. Similar ideas certainly ap- 
ply to the differential case where the rotation and translation are replaced by 
angular and linear velocities respectively m- One can show that they all in fact 
minimize certain normalized versions of the differential epipolar constraint. We 
hope the Riemannian optimization theoretic viewpoint proposed in this paper 
will provide a different perspective to revisit these schemes. Although the study 
of the proposed algorithms is carried out in a calibrated camera framework, due 
to a clear geometric connection between the calibrated and uncalibrated case 
PH, the same approach and optimization schemes can be generalized with little 
effort to the uncalibrated case as well. Details will be presented in future work. 
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Discussion 

Kenichi Kanatani: You compare your method with other techniques, but in 
my view what you should really do is compare it with the theoretical accuracy 
bound, the lower bound beyond which accuracy can’t be improved. For the 
problems you have described so far it is very easy to derive this bound. 

Jana Kosecka: Theoretical accuracy is usually expressed in terms of the 
Cramer-Rao bound, but there’s an alternative way to look at it. If one bases 
the optimization on the epipolar constraint, it turns out that no matter what 
you do, half of the variance always gets absorbed by the structure. You can 
not do better than that — the error along the epipolar line gets absorbed by 
the structure, so you can only improve the error perpendicular to the epipolar 
line. One can even consider this as an alternative means of putting some lower 
bound on the estimates using these kind of techniques. Also, Weng, Huang and 
Ahuja [21] already did the comparision with the theoretical bound. Rather than 
repeating this analysis, we preferred to give a complementary viewpoint. 
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Abstract. We attack the problem of coordinate frame dependence and 
gauge freedoms in structure-from-motion. We are able to formulate a 
bundle adjustment algorithm whose results are independent of both the 
coordinate frame chosen to represent the scene and the ordering of the 
images. This method is more efficient that existing approaches to the 
problem in photogrammetry. 

We demonstrate that to achieve coordinate frame independent results, 
(i) Rotations should be represented by quaternions or local rotation pa- 
rameters, not angles, and (ii) the translation vector describing the cam- 
era/scene motion should be represented in scene 3D coordinates, not 
camera 3D coordinates, two representations which are normally treated 
as interchangeable. The algorithm allows 3D point and line features to be 
reconstructed. Implementation is via the efficient recursive partitioning 
algorithm common in photogrammetry. Results are presented demon- 
strating the advantages of the new method in terms of the stability of 
the bundle adjustment iterations. 



1 Introduction 

Parameter estimation is usually posed as the minimization of the geometric 
error in the image measurements, and solved by a suitable non-linear optimiza- 
tion algorithm. Convergence and stability are two recurrent important issues in 
implementing such algorithms. There are usually several factors at work here, 
from the obvious ones, such as design of the algorithm, and its ability to sup- 
press noise, to less obvious ones such as correlations in the measurements. Given 
many candidates for culprits when experiencing problems, it is often difficult to 
determine which factor(s) are at fault. We shall attack one of these, with the aim 
of eliminating it. This effect is the coordinate frame ambiguity, which arises 
from the fact that simply selecting different coordinate frames for the space of 
parameters may affect the results of algorithms using that representation. 

Coordinate frame ambiguities (or gauge freedoms) arise in problems where the 
natural representations over-parametrize the problem. The “extra” parameters 
are those that specify a coordinate frame. Problems of interpreting geometric 
aspects of a scene (e.g. its 3D shape) by combining multiple observations of it. 
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using a sensor lacking an absolute frame of reference, always have this property. 
This is because to compute geometrical quantities, one must define a frame of 
reference, and because no absolute frame of reference is provided by the sensor, 
there is no natural frame and the choice of coordinate frame is arbitrary. One can 
obtain different, equivalent answers using different frames of reference. However 
designing algorithms the necessary independence property to make the choice 
truly arbitrary, in the sense that the algorithm behaves in equivalent ways given 
different choices, is not trivial. 

Achieving coordinate frame independence may seem like a small advance, 
especially when good results have been achieved in 3D reconstruction without 
much care given to this issue. However we argue that if reconstruction algo- 
rithms are to be integrated into larger vision systems, it is vital that they have 
predictable performance. Elimination of the effect of arbitrary choices inside al- 
gorithms is an important step in this direction. Moreover we have found in the 
case of projective 3D reconstruction that our methods deliver improvements in 
the convergence compared with alternative methods jOj. 

One of the surprising results of our work is that even where strong non- 
linearity is present in the projection equations, such as in the projective and 
perspective models, the effective choice of coordinate frame can be reduced to 
an orthogonal transformation without affecting our desideratum of independence 
to image ordering. This is achieved by a combination of appropriate choice of 
scene motion and structure representation, suitable rank-deficient linear system 
solution methods, and a normalisation step which eliminates the non-orthogonal 
part of the coordinate frame ambiguity. In 0 we presented the algorithm for 
projective reconstruction. Here we discuss the principles of gauge independence 
in general while restricting detailed discussion to the case of Euclidean recon- 
struction. 

1.1 Examples of Gauge Freedoms in Vision 

There are many examples of optimisation algorithms in vision where the natural 
representation of the parameters to be estimated contains gauge freedoms:- 

— The fundamental matrix is a 3 x 3 matrix defining the epipolar geometry of 
two uncalibrated views of a scene. It has two redundant degrees of freedom, 
one being a scale gauge freedom, the other being the constraint that the 
matrix is singular. 

— The problem of registering multiple range images is typically attacked by 
selecting a single “reference” range map, and registering the others to the 
coordinate frame of the reference map P] . Clearly the results will depend on 
the choice of reference range map. 

— Any optimisation involving quantities whose natural representation is in ho- 
mogeneous coordinates, and so has a scale freedom. This turns up in pro- 
jective reconstruction, such as estimating the fundamental matrix as above, 
and also applies when computing structure and motion. There is always a 
choice whether to fix one of the elements of the vector/matrix to unity or to 
keep the full representation and use constrained optimisation. 
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— Structure-from-motion and photogrammetry share the coordinate frame 
problem (known as the ZOD or zero-order design problem by photogram- 
metrists Q). This is the main subject of the current work. 



1.2 Three-Stranded Approach 

To achieve gauge independent algorithms, our design has three separate but 
interrelated strands, which we introduce here. Because of the ubiquity of the 
phrase “gauge independence” in the paper, we shall abbreviate it to “GI” . The 
meaning of gauge independence in this context is that if two reconstructions 
are related by a coordinate frame transformation, which can be thought of as 
two separate “runs” of the algorithm, then the optimisation algorithm should 
maintain the same coordinate frame transformation throughout, so that at the 
end of the algorithm the two reconstructions are still equivalent. 



Representation. There are often choices of representation available which are 
difficult to distinguish on first appearance. Some examples relevant to the dis- 
cussion here:- 

1. 3D rotations can be represented in a variety of angle systems, such as Euler 
or Cardan angles, quaternions, or local rotation parameters. 

2. In projective reconstruction, projective points and projection matrices can be 
represented in homogeneous coordinates, or in non-homogeneous coordinates 
formed by fixing a chosen coordinate at unity. 

3. In Euclidean reconstruction, the translation vector can be represented in 
either camera coordinates or scene coordinates. 

One of the major contributions of this work is to show that for many optimi- 
sation problems the choice of representation may at least partly be decided by 
gauge dependence criteria. We shall see that indeed all the three choices of rep- 
resentation above can be decided by GI (gauge independence) criteria, in favour 
of (1) Local rotation parameters or quaternions, (2) homogeneous coordinates 
normalised to unit norm, and (3) translation vector in scene coordinate frame. 
It may be somewhat surprising that the last case can be decided in this way, 
so we demonstrate the problems with the camera coordinate representation of 
translation in section iSl 

Rank- deficient linear system methods. We will be considering here the so- 
lution of optimisation problems with coordinate frame ambiguities, using Gauss- 
Newton iterative methods. This gives rise to a rank-deficient linear system to be 
solved at each iteration. We shall show how this can be done, maintaining GI. 
Matrix pseudo-inverse has been suggested as a solution to this problem both in 
photogrammetry P| and recently in computer vision HD]. We propose a more 
flexible approach involving the introduction of artificial extra constraints applies 
to the updated solution. With suitable choice of constraints, the two methods 
can be made equivalent. 
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Coordinate frame normalisation. To facilitate the application of the rank- 
deficient linear system methods, it turns out to be necessary in some cases, and 
advantageous in others, to adjust the coordinate system among the space of pos- 
sible frames and thus impose some pre-specified constraints, which will typically 
be the same as the artificial constraints used to solve the rank-deficient system; 
in other words the same constraints are employed before each iteration (normal- 
isation) and also to the updated solution (when eliminating rank deficiencies). 
However the linear system methods are only able to impose the constraints to 
first order, which in a non-linear system implies that the normalisation must 
be applied between iterations to re-impose the constraints. The normalisation 
reduces the effective space of possible coordinate frames, but of course it must 
itself be applied in a GI manner, as defined in section 1^31 

We propose that these three techniques constitute a good general framework 
for applying GI to optimisation problems involving gauge freedoms. In the fol- 
lowing sections we shall fill out these ideas in more detail. Section 0 introduces 
the concepts mathematically and defines the notation used in the remainder of 
the paper. Section 0 defines the Gauss-Newton method, and derives the general 
GI criteria in detail, to be applied in later sections to our models. The perspec- 
tive projection model for point features is then discussed in detail in section 0 
We demonstrate that our chosen model obeys the GI criteria. The complete 
algorithm for gauge-independent Euclidean reconstruction is summarised in sec- 
tional Some preliminary results are presented in section El 



2 Definitions and Notation 

Let us consider a general iterative algorithm that estimates a set of parameters 
X from an observation 

z = h(x) -I- w 

where h(.) is the observation function and w a noise vector with covariance R. 
With a realistic scenario with multiple observations in multiple images, one may 
consider at this point that all the observations are “stacked” into this single 
vector z, and like wise all the unknowns (e.g. the structure and motion param- 
eters for the structure-from-motion problem) are stacked into a single vector x. 
At each iteration, the algorithm takes as input the previous estimate x“, and 
computes a new estimate x"^, which is based solely on the old value x“ and the 
measurement z. We shall assume that an additive rule is being employed, as is 
the case for the Gauss-Newton variants used almost exclusively in optimization 
algorithms for 3D vision and photogrammetry. Then the update rule may be 
written in general as 

x+ = x“ -I- f(h,x”, z). (1) 

The algorithm is thus defined by the observation function h(.), the latest state 
estimate x“ and the measurement vector z. As more iterations are made, and 
with a good wind, the estimates x'*' converge towards the true parameter vector, 
as closely as possible given the inevitable errors in the measurements. Now let us 
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consider redefining the space of parameters in a different way, using parameters 
y, which may be assumed to be related by an invertible mapping g(.) to the 
original parameter vector x: 

y = g(x), x = g"^(y). 

In the case of g(.) representing a genuine gauge freedom, the transformation has 
no effect on the measurements: 

h(y) = h(g(x)) = h(x). (2) 

In other words different choices of g(.) represent purely internal choices of coordi- 
nate frame in which to represent x. The iterative update rule in this transformed 
space is then 

= y" +f(h,y",z). (3) 

We can now define our algorithm as independent of the choice of coordinates 
if when applying the algorithm from different starting points Xq and y,^, re- 
lated by the coordinate transformation g(.), remain related by the same g(.) at 
each corresponding iteration. Thus combining equations o and with this 
criterion, we need to show that for a particular problem that, given the lat- 
est state estimates x“, y“ in the two algorithm instances, and the coordinate 
transformation g(.) between them, that 

g(x" -I- f(h,x",z)) = y" -hf(h,y",z) 

= g(x") -|-f(h,g(x”),z) (4) 

This is the gauge independence ( GI) criterion. If we can prove it for a particular 
problem, we have eliminated the effect of coordinate frame choice. 

Throughout the paper we will take the vector norm ||x|| of a vector x to 
indicate the 2-norm (x^x)!. 

3 The Gauss-Newton Method 

We now specialize further to the Gauss-Newton method, a least-squares method 
commonly used to obtain maximum likelihood or maximum a posteori estimates 
of parameters. An adjusted version of the Gauss-Newton scheme that deals with 
conditioning problems is known as the Levenberg-Marquardt algorithm, but we 
first consider the simpler form. We first define an error function to be minimized, 
based on discrepancies in observations, and extrapolate second-order approxi- 
mations to the error function from the current parameter estimate, which then 
yields the new estimate as the minimum of the second-order hypersurface, which 
may be easily computed. 

Following the usual theory of maximum likelihood parameter estimation j^, 
we assume that we have made several noisy observations z(j), j = 1, . . . , A:, (which 
can where appropriate be bundled into a single vector z as above), related to 
the parameter vector x through measurement functions h(j): 

z:0) = h(j; x) -|- w(j), j = l,...,k 



( 5 ) 
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Here w(j) is random independently distributed noise with covariance R{j), and 
is assumed to have zero mean, i.e. it is unbiased. 

We construct the error function J from as the sum-of-squares of discrepancies 
between the actual observations and those predicted by a modelling parameter 
estimate x: 



k 

~ h(j;x))^i?0)"i(z(i) - h(j;x)). (6) 

We form a second-order (quadratic) approximation to J(.) around the latest 
estimate x~, and locate the minimum of the quadratic extrapolation of J(.) 
to obtain a new estimate x+. Omitting details of this standard procedure, we 
obtain 

k 

H(x+ - X") = ^ Hu)^ R{j)~^uU) = a, (7) 



— The matrix A = H(j)^ R(j)~^H{j) may be identified as the state infor- 

mation (inverse covariance) matrix; 

— The innovation vectors u{j) are zz(j) = zO) — h(j;x“); 

— The Jacobian matrices H{j) are H{j) = evaluated at x“. 

Equation dzj represents one Gauss-Newton update iteration. 

3.1 G-N Iterations with Gauge Freedoms 

Summarising the above discussion, our Gauss-Newton update consists of solving 
the symmetric matrix equation 



A Z\x = a (8) 

for state change vector Z\x = x'*' — x“ , given matrix A and vector a computed 
from the measurement vectors z(j), measurement functions h(j) and Jacobians 
H(o) = 3h(j)/5x, evaluated at the latest solution x“. H is rank-deficient, because 
of gauge freedoms, and so extra constraints must be introduced in order to 
provide a unique solution to the matrix equation. These are the choices we shall 
consider for the form of these constraints:- 

1. Gauge Fixiug methods enforce gauge conditions 

c(x) = 0 

to force a chosen gauge on the solution. We shall assume that the gauge 
conditions apply exactly to the previous solution, i.e. c(x“) = 0. There are 
two ways to impose gauge c(x'’') = 0 on the new solution x+: 

(a) Weighting methods incorporate the gauge conditions c(x+) as extra 
“virtual” observations to be integrated with the actual measurements. 
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(b) Projection project the state space into a smaller space in which the 
gauge conditions are approximately satisfied; solve for the update in the 
reduced state space; finally project the solution back into the original 
space. 

For brevity we shall consider the latter technique only. 

2. Pseudo-Inverse methods attack the matrix equation (0) directly using a 
form of pseudo-inverse applied to A to solve for Ax. This method has been 
suggested by Morris & Kanatani mu, and possesses the same GI properties 
as our proposed gauge fixing method. However it is somewhat slower and 
less flexible. 

3. Elimination methods eliminate the necessary number of state parameters 
by setting them to special values. The remaining parameters can then be 
solved for directly. Such a procedure is also referred to as selecting a canonical 
frame I^El . or a normal form m This method can be used to eliminate 
the effect of coordinate frame ambiguity, simply by fixing the same number 
of parameters as there are gauge freedoms, but in the context of 3D recon- 
struction from multiple images this procedure introduces a new dependence 
on the order of the images. 



3.2 Gauge Fixing 

Enforcing a chosen gauge c(.) by linearising c(.) about the latest solution x 
gives us the system of equations 



A Ax = a. 



CAx = 0, 



dc 

where C = — 
ox 



( 9 ) 



The equation C Ax = 0 ensures that the gauge conditions will be enforced to first 
order on the solution vector x'*'. In order for the constraint to entirely remove 
the gauge freedoms, it is clear that the rows of C must span the null-space of 
A, although the two linear spaces do not have to be equal. Equation (0 is a 
standard linear system with equality constraints, and jS| suggest the following 
method of solution: 

To apply the alternative “projection” method of gauge fixing we first write 
the Singular Value Decomposition (SVD) of the constraint Jacobian matrix C = 
dc/dx as 

c = uwv^ = u “) . 

If there are r gauge conditions, the size of W\ is r x r. Now define a transformed 
state vector 




If we convert the equations (0 into the transformed state space y, y', we obtain 



^(Vi V2) 



= a, C(ViV 2 ) 



= 0 . 
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The latter equation simplifies to 



U 





0 or y' = 0, 



enforcing the gauge conditions C Z\x = 0. The main equation now becomes 



AV2y' = a 



from which we can obtain the solutions for y and x+: 



y = (V2 AV2) ^¥2^ a, x+ = X + V2Y 



( 11 ) 



Thus this method of solution involves projecting x into the smaller state space 
y, in which the gauge conditions are enforced to first order, solving the uncon- 
strained linear system in that space and projecting back into x space to obtain 
the final solution (EJ. 



Gauge independence of projection method. If we hypothesise that coor- 
dinate frame changes are always linear orthogonal, 



y = 4>x (12) 

for orthogonal matrix (j), and also that the constraint function c(.) transforms 
similarly as 

c(y) = 6»c(x) (13) 

for orthogonal matrix 9, then we can demonstrate gauge independence. Firstly 
we write the SVD of the constraint Jacobian C" in the transformed frame as 

C" = U'W'V'^ = = 9UWV^ <t)-^ = {9U)W{4>VY 

and so we have V 2 = <j>V 2 . Also the innovation vectors are 

v'U) = z(i) - h(j;y") = i/(j). 



because of the pure gauge freedom assumption Q. The measurement Jacobians 
H'{j) in the transformed space are related to H{j) as 



H'U) 



dh{j) 

dy 



dhjj) dx 
dx dy 






Thus the information matrix A and RHS vector a transform to 



A' = (jiAff) a' = 0a. 



This leads to the update rule is the transformed frame:- 

y+=y- + V^{vC A'V'r^vf sl' = y-+ cj,V 2 {V^ AV 2 )-^V^ a 
= y~ + 0(x+ - x“) 



(14) 
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which is gauge-independent in the sense of preserving the same (j) between the 
solutions before and after the iteration, given that y“ = , thus satisfying 

the main GI criterion 

The criteria m and da can be applied to specific models in order to select 
the state representation and constraint functions in specific scenarios, which is 
what we do in later sections for the structure-from-motion problem. 

3.3 Coordinate Frame Normalisation 

The third strand of our method of dealing with gauge freedoms is to normalise 
the coordinate frame of our estimated parameters before applying Gauss-Newton 
iterations. We select a coordinate frame among the space of frames that agrees 
with pre-specified normalisation conditions. There is a strong link with the gauge 
fixing conditions discussed, since the most natural choice of gauge fixing condi- 
tions is then to re-impose the same conditions as were used to select the initial 
coordinate frame. 

The intended effect of coordinate frame normalisation is illustrated in figure ^ 
As before we consider a general change of coordinates y = g(x) relating two 



normalisation of X 



general g(x) 




orthogonal (|) X ’ 



normalisation of y 



Fig. 1. Illustration of coordinate frame normalisation. Applying normalisation to two 
different parameter vectors x related by a general coordinate frame transformation 
y = g(x) reduces the space of transformations after normalisation of both to the space 
of orthogonal transformations y' = <j>x.' for orthogonal matrix 0. 



equivalent parameter vector estimates. The idea of normalisation is to reduce 
the effective space of coordinate frames, ideally to the space of (possibly scaled) 
linear orthogonal transformations. If this can be achieved, we can proceed to 
use the GI Gauss-Newton methods described above. We assume in figure Q] that 
no scale factor (3 remains after normalisation, because in the problems we have 
looked at we have been able to remove any scale factors, including, perhaps 
surprisingly, projective 3D reconstruction. 

In the following sections we shall treating each different type of gauge free- 
dom separately in demonstrating satisfaction of the GI criteria. This is a valid 
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approach because when considered in combination we obtain a product of the 
orthogonal transformations corresponding to all the gauge freedoms together, 
which is another orthogonal transformation. Hence if we prove that the GI cri- 
teria are satisfied for all sub-parts of our representation, we automatically have 
GI for the whole system. 

4 Perspective Projection Model 

The projection model corresponding to Euclidean SFM is the perspective mo- 
del m- It is the most detailed model and applies when the camera calibration 
is known. We can write perspective projection for 3D points X = (X F Z)^ as 



— p is the projected image point in homogeneous coordinates; 

— K is the matrix of calibration parameters, defined by the focal lengths fx, 
fy, the image centre coordinates xq, yo and skew parameter a; 

— i? is a 3 X 3 rotation matrix; 

— T is a translation vector in scene coordinates, the camera position. 

4.1 Observation Model 

To construct a measurement vector for use in Gauss-Newton iterations, p is 
converted to non-homogeneous form by dividing through by z. Then the mea- 
surement vector 2 ,i{j) is the image feature position for scene feature i in image 
j, for instance a corner feature. 

4.2 Representing Rotations 

Representing a 3D rotation is problematic in the context of gauge invariance. 
Angle representations do not have the GI properties, because they clearly do 
not transform according to the GI criterion C2I). There is also the problem that 
angle representations, and indeed any non-redundant representations of rotation, 
have singularities near which small changes in rotations have uncontrollably 
large changes in the rotation parameters. There are however two good choices, 
one redundant and one non-redundant. Quaternions are a good choice, and can 
provide the solutions to many problems in vision, such can computing the camera 
rotation between images m or camera pose estimation [7|- However they have 




(15) 



where 
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the problem that quaternions are a redundant representation, having a scale 
freedom. This is easy to handle using the gauge methods employed in this paper, 
but we prefer the local rotation approach, employed in 3D reconstruction by 
Taylor & Kriegman m A local rotation representation has also been developed 
independently by Pennec & Thirion m- The idea is that if an estimate Rq of 
the rotation is available, one can factorise a rotation R as 

R = RsRo (16) 

where Rg is a small rotation. Because Rg is small, we can use a representation 
that is specialised to small rotations. The exponential representation is ideal for 
this purpose. This represents a rotation using a 3- vector r by the Rodriguez 
formula 



= hx3 



sin 9 , 



(1 — cos 0) 
02 



(rr^ - ||rf / 3 x 3 ) 



(17) 



where 9 = ||r|| is the rotation angle, and [r]x is the “cross-product” matrix for 
r. In an iterative algorithm, the small rotation change Rg is compounded with 
the previous value Rq to form the new rotation estimate RgRo to be fed in as 
the new Rq at the next iteration. 



4.3 Euclidean Ambiguities and Constraints 

The global coordinate frame ambiguity for the Euclidean case is a 3D similarity 
transformation, as we can see by rewriting © as 

P = X{R\ -RT) 

= X{R' I - i?'T') 

where Rh is a 3 x 3 rotation matrix, T// a translation vector, h a scale factor, and 
R' = RRh\ T' = h-'^iRnT + Tn) and X' = /r-i(i?jjX + Tjj). The Euclidean 
coordinate frame ambiguity is represented by Rh, T^h and h. This provides 7 
degrees of freedom in setting up the global coordinate frame. 

4.4 Euclidean Normalisation 

We first normalise the coordinate frame using constraints on the translation 
vectors (three constraints) and overall scale (one). No normalisation needs to be 
applied to the rotations. Then we introduce a constraint function c(.) such that 
the rows of its Jacobian C are linear combinations of the null-space vectors of 
A, thus guaranteeing elimination of all seven gauge freedoms. 



194 



P.F. McLauchlan 



The coordinate frame normalisation subtracts the centroid of the camera 
translation vectors from each translation vector, and then scales the translation 
vectors so that their average squared length is unity. Given that the original 
centroid of the translations is T, and writing 



e;=jito)-t|| 



averaging over the k images, we need to make the following modifications to all 
the parameters: 



TO)^ -(T(j)-T) 

5 

X ^ -(X - X) for all 3D points, 
s 

The modifications being made together ensures that the modified reconstruc- 
tion is equivalent to the original. After the normalisation, the centroid of the 
translations is fixed at zero and the scale set to unity. The normalisation leaves 
only rotation unchanged. Now under a rotation Rh of initial coordinate frame 
as above, normalising both coordinate frames as in figure ^ the rotations and 
translations in the two frames are related after normalisation as 



R'U) = RU)Rh, T'o) = RhT. 



Since the reference rotations will be reset to the normalised rotations in both 
cases, both sets of local rotation parameters will be set to zero. This means that 
the motion parameters after normalisation will have the relationship 



/r'0)\ ^ (I 0\ fru)\ 

\T'U)J \0RhJ\Tu)J 



(18) 



an orthogonal change of coordinates, agreeing with the GI criterion ll I '.^ll . Sim- 
ilarly the point 3D features are normalised in the two coordinate frames to X' 
which will be related as X' = i?/fX, again agreeing with (C3- So our normal- 
isation procedure has successfully achieved its aim of reducing the coordinate 
frame freedom to a combination of orthogonal transformations. 



4.5 Euclidean Gauge Conditions 



Having normalised the coordinate frame, we separate the seven gauge conditions 
into three rotation conditions Cr(.), three translation ct(.) and one scale Cs(.), 
as follows: 

+ [TO')]xT(j)\ 

V |l|TO-)f J 
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We include the vectors T(:/) with the convention that they are represented as 
constant vectors in c(.), equal to T(j) but not considered as variables. Thus the 
Jacobian of c(.) i 



C = 



' Oc, 


dCf 


yr(l) 


aT(i) ■ ■ 


C^Cx 


Ocx 


9r(l) 


0T(1) ■ ■ 


dcs 


Ocs 


,ar(l) 


dT(l) ■ ■ 




-[T(D] 


0 


I 


0^ 


T(i)T 



dCr 


dCr 


dr(k) 


ax(fc) 


5cx 


ycx 


dr(k) 


9X(fc) 


dcs 


dcs 



dr{k) ax(fc) / 



. 0 / 

, 0^ T(fc)^ / 



To demonstrate agreement with the GI criterion (ESJ, we use the relation (m, 
obtaining under the transformed normalised coordinate frame, 



c; = ^(r'o-)+[T'o)]xT'a)) 

3 

3 



Trivially we also have c!^ = RhCt and finally = Cg. 



4.6 Translation Represented in Camera Frame 

We now discuss an alternative model whereby we represent the translation vector 
in camera coordinates. In other words the projection model (113 converts to 
p = XK{HX. + T). We can follow through the previous model, and we find that 
the results do not agree with the GI criteria (details omitted). 

5 Algorithm Description 

1. Start with a prior estimate x~ for the state parameters x. Some of them may 
be provided in advance, for instance camera calibration parameters where 
applicable. Others may be given initial estimates directly from the observa- 
tions. In 3D reconstruction, the multi-view tensorial approach is the latest 
way to generate motion and structure parameters directly from images pj. 
The first step is to normalise the coordinate frame as specified by any nor- 
malisation conditions (section [3.311 . Then if there are gauge fixing constraints 
to be enforced, these must be enforced prior to starting the iterations. They 
are then re-enforced between each iteration, as described below. 

2. Build the linear system (0 by linearising the measurement equations for each 
observation around the latest state parameters x“. For the 3D reconstruc- 
tion problem, these are the non-homogeneous versions of the point projection 
equation (113. If gauge constraints are to be enforced to first order by the 
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weighting or projection methods, the linear system is adjusted to dnj. When 
the state vector x is partitioned into blocks, constraints can be applied sep- 
arately to single blocks or combinations of blocks, and arbitrary mixtures of 
different constraint types are allowed for each constraint. 

3. Perform a Gauss-Newton iteration 0, to produce a new estimate x+ for the 
state parameters. We actually use Levenberg-Marquardt iterations, which 
are basic Gauss-Newton iterations with added damping. 

4. Re-impose any normalisation and gauge fixing conditions by manipulation 
of the state vector x+. Because this only corrects internal gauge freedoms, 
the error function J(x+) is not affected. 

5. Other internal degrees of freedom that are not gauge freedoms may be reset. 
For instance, in the case of Euclidean reconstruction, the reference rotations 
RqU) are adjusted to ReU)RoU), where RsU) is the small rotation formed 
from the updated rotation parameters r(j). This allows r(j) to be reset to 
zero. All such changes are fed back to the state vector x'*'. 

6. If the error function J(x+) in (jOj) is has decreased from J(x“), replace x“ 
with the updated and adjusted state vector x+, and decrease the Levenberg- 
Marquardt damping factor. Otherwise increase the damping factor and make 
no change to x“. 

7. If a termination criterion has been reached (e.g. limit on number of iterations 
or on the size of a decrease in J), exit. Otherwise loop back to step 2 and 
perform another iteration. 

6 Results 

In figure 0 we show some preliminary results on a test-set of seven simulated 
images of fifteen 3D points. To make the reconstruction difficult, the points were 
placed so that they almost aligned in a single plane, a critical surface for 3D 
reconstruction. This allows the advantages of imposing gauge conditions to be 
made apparent. The convergence of the algorithm on this test set is quicker than 
the other algorithms here on test, which are:- 

— Camera-centred translation (Gamera T-1 and Gamera T-2). These results 
for the camera-centred translation vector representation were obtained for 
different initial coordinate frames, and show that the convergence rate in 
this case is affected by coordinate frame choice. The total time taken by this 
version was 1.60s on a 233MHz PG. 

— Free Gauge is the method advocated in whereby Levenberg-Marquardt 
damping is used to deal with the gauge freedoms. In well-conditioned sit- 
uations this method is comparable in performance with imposing explicit 
gauge conditions, but as we see here when the conditioning is quite bad, 
having gauge conditions helps because Levenberg-Marquardt is left to deal 
with the conditioning of the system, which in the free gauge algorithm are 
“masked” by the gauge freedoms. Total time: 1.63s. 

— Pseudo-inverse is the of taking the pseudo-inverse of the information ma- 
trix used by photogrammetrists PJ and suggested by Morris & Kanatani [TTlj . 
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Fig. 2. Results showing convergence rate for a test set of simulated data, testing various 
types of bundle adjustment (see text). 



although adjusted to firstly eliminate the structure parameters from the sys- 
tem to improve the performance. Total time: 11.91s. 

— Gauge Conditions is the gauge condition method. Total time: 10.95s. 

In other experiments, not shown here because of lack of space, we have shown 
that the convergence rates for a well-conditioned 3D Euclidean reconstruction 
problem are comparable for all these methods, and others we have tried. It is 
only when the conditioning is bad that the gauge condition method is likely to 
show great advantages, and it is considerable slower than the free-gauge algo- 
rithm, because of the extra expense of factorizing the gauge condition matrix. 
Nevertheless where reliability is important, the explicit gauge condition method 
should be considered, and in any case our discussions of representation and co- 
ordinate frame normalisation still apply. The speed difference will reduce if the 
ratio of features to images is increased. 

7 Conclusions 

We have developed the theory of gauge independence in some detail for the 
problem of 3D scene reconstruction. The methods are applicable to many op- 
timisation problems having internal gauge freedoms, especially those resulting 
in a sparse information matrix structure. We have omitted the generalisation of 
the method to reconstructions of lines and other projection models; they will 
appear in a longer treatment. 
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Discussion 

Kenichi Kanatani: As you mentioned, gauge freedom is something that occurs 
in our description of the problem. It is nothing to do with the real world, the 
observations, or the noise. Changing the gauge doesn’t affect these, the final 
results should be the same or at least geometrically equivalent. 

Philip McLauchlan: Yes, that’s correct. 

Kenichi Kanatani: But you showed some convergence graphs where the fi- 
nal results seem to be different for different gauge fixing methods. Apart from 
computational efficiency, do the results depend on the gauge or not? 

Philip McLauchlan: They are all converging to the same answer — if you 
leave them to run long enough they do converge to the same global minimum. 
But the speed at which they do it, and whether they do it at different rates for 
completely arbitrary reasons like changing the coordinate frame, depends a lot 
on the gauge fixing method. 

Kenichi Kanatani: So there are two issues for the choice, speed of convergence 
and computational efficiency, like fill-in problems. 

Philip McLauchlan: Yes, that’s right. 

Rick Szeliski: This is a clarification question. In addition to renormalizing at 
each step to keep the centroid at zero and the scale at unity, do you also impose 
this as a gauge condition so that the delta, the change in movement has zero 
centroid and there is no change in scale? 

Philip McLauchlan: That is correct. 

Rick Szeliski: You also made a comment that by eliminating either the struc- 
ture or the motion you can get more efficient. Does that just depend on the 
problem, whether there are more frames or more points? 

Philip McLauchlan: That’s right. Usually these problem have a large num- 
ber of features and a relatively small number of images, so it makes sense to 
eliminate the structure parameters first because that is fast, and then to solve 
for the motion and then back-substitute to obtain the structure, whereas the 
photogrammetrists’ main solution involves doing it the other way round. They 
argue that they can do this in an approximate way which doesn’t give them a 
big performance problem, but it seems like the whole reason for doing this is 
flawed, and it makes more sense to me that we should aim for the more efficient 
solution which has no disadvantages. 

Joss Knight: Does the gauge conditioning help at all with problems of local 
minima in the minimization? 

Philip McLauchlan: No, not really. The different versions all assume that the 
initial solution is close enough to the global minimum to avoid the local minima. 
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Abstract. The parameters estimated by Structure from Motion (SFM) 
contain inherent indeterminacies which we call gauge freedoms. Under 
a perspective camera, shape and motion parameters are only recovered 
up to an unknown similarity transformation. In this paper we investi- 
gate how covariance-based uncertainty can be represented under these 
gauge freedoms. Past work on uncertainty modeling has implicitly im- 
posed gauge constraints on the solution before considering covariance 
estimation. Here we examine the effect of selecting a particular gauge on 
the uncertainty of parameters. We show potentially dramatic effects of 
gauge choice on parameter uncertainties. However the inherent geomet- 
ric uncertainty remains the same irrespective of gauge choice. We derive 
a Geometric Equivalence Relationship with which covariances under dif- 
ferent parametrizations and gauges can be compared, based on their true 
geometric uncertainty. We show that the uncertainty of gauge invariants 
exactly captures the geometric uncertainty of the solution, and hence 
provides useful measures for evaluating the uncertainty of the solution. 
Finally we propose a fast method for covariance estimation and show its 
correctness using the Geometric Equivalence Relationship. 



1 Introduction 



It is well known that, for accurate 3D reconstruction from image sequences, 
statistically optimal results are obtained by bundle adjustment 1 3f 1 6] . 

This is just Maximum Likelihood estimation for independent, isotropic Gaussian 
noise, and is also used by photogrammetrists. Current research generally focuses 
on two areas: (1) simplicity of solution, which includes finding a closed form 
approximate solutions such as the Factorization method |4l8lhimil 1112] . and (2) 
efficiency, which includes finding fast or robust numerical schemes m- 

An important third area to address is the quantitative assessment of the reli- 
ability of the solution. While some work has incorporated uncertainty analyzes of 
the results N14llhllbj . none has investigated the effect of parameter indetermi- 
nacies on the uncertainty modeling. These indeterminacies are inherent to SFM 
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and have a significant effect on parameter uncertainties. Our goal is to create 
a framework for describing the uncertainties and indeterminacies of parameters 
used in Structure from Motion (SFM). We can then determine how both these 
uncertainties and indeterminacies affect the real geometric measurements recov- 
ered by SFM. 

The standard measure for uncertainty is the covariance matrix. However 
in SFM there is a uniqueness problem for the solution and its variance due to 
inherent indeterminacies: the estimated object feature positions and motions are 
only determined up to a overall translation, rotation and scaling. Constraining 
these global quantities we call choosing a gauge. Typically a covariance matrix 
describes the second order moments of a perturbation around a unique solution. 
In past work mm indeterminacies are removed by choosing an arbitrary 
gauge, and then the optimization is performed under these gauge constraints 
and the recovered shape and motion parameters along with their variances are 
expressed in this gauge. 

In this paper we provide an analysis of the effects of indeterminacies and 
gauges on covariance-based uncertainty models. While the choice of gauge can 
dramatically affect the magnitude and values in a covariance matrix, we show 
that these effects are superficial and the underlying geometric uncertainty is 
unaffected. To show this we derive a Geometric Equivalence Relationship be- 
tween the covariance matrices of the parameters that depends only on the es- 
sential geometric component in the covariances. Hence we are able to propose 
a covariance-based description of parameter uncertainties that does not require 
gauge constraints. Furthermore we show how this parametric uncertainty model 
can be then used to obtain an uncertainty model for actual geometric properties 
of the shape and motion which are gauge-invariant. Optimization is achieved 
in an efficient free-gauge manner and we propose a fast method for obtaining 
covariance estimates when there are indeterminacies. 



2 Geometric Modeling 

2.1 Camera Equations 

Here we describe an object and camera system in a camera-centered coordinate 
system. Analogous equations could be derived in other coordinate systems. Sup- 
pose we track N rigidly moving feature points a = 1, . . . , A^, in M images. 
Let Pko be the 2-element image coordinates of Pa in the Kth image. We iden- 
tify the camera coordinate system with the XY Z world coordinate system, and 
choose an object coordinate system in the object. Let be the origin of the 
object coordinate system in the K’th image, be a 3 x 3 rotation matrix which 
specifies its orientation and s„ be the coordinates of the feature point, p^, in the 
object coordinate system. Thus the position of feature point Po, with respect to 
the camera coordinate system in the Kth image is RkSq, -t- 1^. 

Assume we have a projection operator II : — >■ H? which projects a point 

in 3D to the 2D image plane. We can then express the image coordinates, of 
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feature Po, as: 



]?Ka 



( 1 ) 



where K„ is a 3 x 3 internal camera parameter matrix [2| containing quantities 
such as focal length for each image. While these parameters can be estimated 
along with shape and motion parameters, for simplicity we ignore them in the 
rest of the paper and assume K„ is just the identity matrix for orthography 
and diag{[f, /, 1]) for perspective projection with focal length /. Various camera 
models can be defined by specifying the action of this projection operator on a 
vector {X,Y, Z)^ . For example we define the projection operators for orthogra- 
phy and perspective projection respectively in the following way: 



Hoi 





np[ 





(2) 



Equation (0 can be applied to all features in all images, and then combined in 
the form: 

P = n(e) (3) 

where p = (Pu) Pi 2 ) Pi 3 > ■ • ■ ) Pmat)^ ^ vector containing all the image feature 
coordinates in all images, and 0 is a vector containing the shape and motion 
parameters, R«, s^, t„, and possibly unknown internal camera parameters, for 
all object features and images, and II is the appropriate combination of the 
projection matrices. More details can be found in 0. 



2.2 Parameter Constraints 

Not all of the parameters in 6 are independent and some need to be constrained. 
In particular the columns of each rotation matrix, R^, must remain unit orthog- 
onal vectors. Small perturbations of rotations are parametrized by a 3-vector: 
which to first order maintain the rotation properties [3|. Let 7” be the man- 
ifold of valid vectors 9 such that all solutions for 9 lie in T. T will be a manifold 
of dimension n, where n is the number of parameters needed to locally specify 
the shape and motion, 3 for each rotation, 3 for each translation, and 3 for each 
3D feature point, plus any internal camera parameters that must be estimated. 
So in general for just motion and shape, the number of unknown parameters is: 
n = 3N + 6M. 



2.3 Indeterminacies 

The camera equations m and m contain a number of indeterminacies. There 
are two reasons for these indeterminacies: first the object coordinate system can 
be selected arbitrarily, and second the projection model maps many 3D points 
to a single 2D point. These are specified as follows: 
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Coordinate System Indeterminacies 

If we rotate and then translate the coordinate system by R and t respectively, 
we obtain the following transformed shape and motion parameters: 

= R^(s„ - t), R'„ = RkR, t'^ = R„t + t^. (4) 

We note that R'„s^ + t'^ = RkSc + t„, and hence irrespective of the projection 
model, equations m and o must be ambiguous to changes in coordinates. 

Projection Indeterminacies 

Many different geometric solutions project onto the same points in the image. 
In orthography the depth or Z component does not affect the image, and hence 
the projection is invariant to the transformation: 

t'^ = t„ + dji (5) 

for any value Orthography has a discrete reflection ambiguity, but since 
it is not differential we do not consider it. Perspective projection has a scale 
ambiguity such that if we transform the shape and translation by a scale s: 

s'„ = ss„, and t'^ = st^, (6) 

we And that 7Tp[K„(R,.,s'„ + t'^)] = 77p[K„(R,.,Sa + t«,)]. 



2.4 Solution Manifold 

Since the camera equations contain these indeterminacies, then given the mea- 
surement data, p, there is not a unique shape and motion parameter set, 6, that 
maps to this. Rather equation Q is satisfied by a manifold, M, of valid solutions 
within T which are all mapped to the same p. This manifold has dimension, r, 
given by the number of infinitesimal degrees of freedom at a given point. From 
the ambiguity equations we obtain r = 7 for perspective projection and 

r = M -I- 6 under orthography. Figured illustrates a solution 6 G A4. 




Fig. 1. An illustration of a curve representing a manifold M of solution vectors 0, all 
lying in the parameter space T. Choosing a gauge C, which intersects the manifold at 
one point, defines a unique solution 6c- 
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2.5 Gauge Constraints 

In order to remove the ambiguity of the solution we can define a gauge or man- 
ifold of points: C. Let C contain all those points in T that satisfy a set of r 
constraint equations: 



The gauge C will thus have dimension n — r. We require that C intersect A4 
transversally and at most at one point per connected component of M. The 
intersection of C and M thus provides unique solution within a connected com- 
ponent of A4, as illustrated in Figure ^ However there may be ambiguities 
between components of M, such as the reflection ambiguity in orthography. 

For example, we could define an arbitrary gauge with the following con- 
straints: 



This fixes the origin of the object coordinate system in its centroid, aligns the 
object coordinate system with the first image, and fixes the scale. In orthography 
the scale constraint is omitted, but we add the constraint set: = 0 on the Z 

component of translation. 

We note that this, or any other choice of gauge is arbitrary, and does not affect 
the geometry. It does affect our parameter estimates and their uncertainties, but 
in ways that do not affect the geometric meaning of the results. 

3 Uncertainty in Data Fitting 

When there is noise in the measured data, there will be a resulting uncertainty 
in the recovered parameters, which we would like to represent by a covariance 
matrix. However, when indeterminacies exist, the solution will be a manifold 
rather than a point, and standard perturbation analysis cannot be performed. 
The usual approach, in dealing with this, is to choose a gauge and constrain 
the solution to lie in this gauge. While this approach is a valid, it introduces 
additional constraints into the estimation process, and the resulting uncertainty 
values are strongly dependent on the choice of gauge. In this section we ask the 
question: How can we estimate the geometric uncertainty without depending on 
an arbitrary selection of a gauge? To answer this we introduce gauge invariants 
whose uncertainty does not depend on gauge choice. We also derive a Geometric 
Equivalence Relationship that considers only this “true” geometric uncertainty. 
Along the way we derive the normal form for the covariance which gives us a 
convenient way to calculate uncertainty without having to explicitly specify a 
gauge. 



Ci{6) = 0 for 1 < i < r. 



(7) 



N 



N 




( 8 ) 
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3.1 Perturbation Analysis 

First we derive an uncertainty measure in an arbitrary gauge. We assume that 
the noise is small, and thus that the first order terms dominate. When the noise 
is Gaussian the first order terms exactly describe the noise. The measured data, 
p is a result of the true feature positions, p, corrupted by noise, Z\p: 

p = p + Z\p. (9) 

The noise Z\p is a random variable of the most general type, not necessarily 
independent for different points, but it is assumed to have zero mean and known 
varianc 

Up[p] = F;{z\pz\p^}. (10) 

We note that in the special case when feature points are independent, U[p] will 
be block diagonal with the 2x2 block diagonal elements giving the independent 
feature covariances. 

Given this uncertainty in the measured data, let 9 be our estimator of the 
shape and motion parameters. There is no unique true solution unless we restrict 
our estimation to a particular gauge. If we choose gauge C our estimator can be 
written as: 9c = 6c + A9c^ for true solution 9c and perturbation A9c- The 
perturbation A9c and its variance, V\A9c], both lie in the tangent plane to the 
gauge manifold, Tg^ [C] . 

We expand equation @ around 9c and get to first order: 

VlU{9c)A9c = Ap (11) 

where is the gradient with respect to 9 in the manifold T. We then split the 
perturbations, A9c into two components, those in Tg^ [M\ and those in TeAM]^ 
as shown in Figured 

A9c = + A9c^^. (12) 

The gradient V^II(0) is orthogonal to the tangent plane of M and has rank 
n — r. We can thus solve for the orthogonal perturbations: 

A9c^^ = {VlU{h))--r^P, (13) 

where denotes the Moore-Penrose generalized invers^ constrained to have 
rank n — r. We call the covariance of this orthogonal component the normal 
covariance: 



v^M[e] = (v^n(0))-_,Up(v^n(0))~T,. (i4) 

The normal covariance is expressed at a particular solution, 9, and depends on 
our choice of parametrization and implicitly assumes a metric over parameter 

^ We can extend this to the case when variance is known only up to a scale factor 
^ The Moore-Penrose generalized inverse is defined such that if T = U AV^ by SVD, 
then A~^ = V A'^U^ , where Aj^ has the first N singular values inverted on the 
diagonal, and the rest zeroed. 
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space. But it does not require explicit gauge constraints, (rather implicitly as- 
sumes a gauge normal to the manifold), and as we shall see, it incorporates all 
of the essential geometric uncertainty in the solution. 

When the indeterminacies are removed by adding constraints the normal 
covariance must be obliquely projected onto the appropriate constraint surface. 
The uncertainty in the gauge will be in its tangent plane: AOq G T[C]. We 
already know the perturbation, AOc'^'^, orthogonal to T[A4], and so it only 
remains to derive the component parallel to as illustrated in Figure |21 




Fig. 2. An illustration of the oblique projection of perturbations along the solu- 
tion tangent space, T\M], and onto the gauge manifold tangent space T[C]: A6c = 
AOc^''^ + This projection transforms the normal covariance matrix into the 

local gauge covariance. 



Let U he & matrix with r columns spanning at Oq, and let ^ be a 

matrix with r columns spanning the space orthogonal to T[C\ at 9c- Then we 
can express equation ED as: 



AOc = AOc^^ + f/x. (15) 

for some unknown coefficients x. The fact that this perturbation is in the con- 
straint tangent plane, implies that AOc = 0- Applying this to (113 and elim- 
inating X we obtain: 

A9c = Q^AOc^^ (16) 

where = I — U{V^U)~^V^ is our oblique projection operator along T[M\. 
The covariance of 0 in this gauge is then given by: 

Vc[9c\ = Q‘^v^M[ec\Q^^ ■ 



(17) 
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3.2 Inherent Geometric Uncertainty 

The camera equations provide geometric constraints on the measurements. Pa- 
rameters containing indeterminacies correspond to entities not fully constrained 
by the camera equations, whereas parameters which have a unique value over 
the solution manifold are fully constrained. These fully constrained parameters 
describe the “true” geometric entities. They can be uniquely recovered, (up to 
possibly a discrete ambiguity), from the camera equations. Having a unique 
value on the solution manifold means that the parameter is invariant to gauge 
transformations on the solution. We call these gauge invariants. 

Not only are the values of gauge invariants unique, but given the covariance 
of the measured data, the covariance of the invariant is uniquely obtainable. 
However, the covariance of parameters containing indeterminacies will not be 
uniquely specified and many possible “geometrically equivalent” covariances can 
be obtained that correspond to the same measurement covariance. In this section 
we derive a Geometric Equivalence Relationship for parameters that contain in- 
determinacies. This permits us to test whether covariances of these parameters 
under different gauges correspond to the same underlying measurement covari- 
ance or not. Finally we propose a fast method for covariance estimation and 
show its correctness using the Geometric Equivalence Relationship. 

Let us assume that we are measuring an invariant property, I{0), of the 
solution. Gonsider the estimators in two gauges: 6c and 6c with uncertainties: 
A6c and A6c in their corresponding tangent planes. Let be the Jacobian 
matrix that maps perturbations in the tangent plane of C' to perturbations in 
the tangent plane of C: 

A6c = ^A6c. (18) 

o6c 

The invariant property will have the same value for both solutions: I {6c) = 
I{6c')- Moreover, since / is invariant to all points in AJ, it must also be invariant 
to infinitesimal perturbations in Ai, and hence its gradient must be orthogonal 
to the tangent plane of 

Vll&T[M]^. (19) 

A perturbation of the invariant at 6c can be written: 

AI{6c) = VllA6c = Vll^A6c. ( 20 ) 

o6c 

The variance of the invariant can be calculated using both components of this 
equation: 

V[I] = VllV[6c]Vll^ = 

The covariances of parameters with indeterminacies may have “non-geometric” 
components along the tangent plane of the solution manifold. This equation 
transforms these covariances into the uniquely defined covariance of a gauge 
invariant. 
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We then apply the orthogonal constraint from equation lp|l to both expres- 
sions for V\I] and obtain the following result: 

(22) 

This means that the difference between the covariance and the transformed co- 
variance: V[9c] — §§^^[(^C']§§^^ must lie in the the tangent space Te^ [A^]. Or 
equivalently we can say that these two variances have the same orthogonal com- 
ponent to T[M] at 9c- We denote this relationship as: V[9c] = V[9c’] mod M. 
Thus we have: 

Geometric Equivalence Relationship The covariance matrices E[0c] (ind 
V[9c'] arc geometrically equivalent if and only if 

E[6/c] = E[6/c'] mod 7W. (23) 

In essence this says that at a point 0 S At, it is only the component of the 
covariance that is not in the tangent plane that contributes to the geometric 
uncertainty. Any matrix satisfying this equivalence relationship captures the 
geometric uncertainty of the parameters. The normal form of the covariance 
calculated from equation (HI) is a natural choice that captures this uncertainty 
for a given parametrization, and does not require constraints to be specified. 
From this relationship we see that the covariance in any gauge is equivalent 
to the normal covariance, i.e.: Vc[0c] = hj_^[0] mod A4. Thus the covariance 
of an invariant can be calculated directly from either of these covariances by 
transforming them with the invariant gradient, V^/, as in equation 

4 Maximum Likelihood Estimation 

It is known that Maximum Likelihood (ML) estimation is unbiased and obtains 
the optimal shape and motion parameters. The ML solution is obtained by 
minimizing the cost: 

j = (p-n(0))^i/-i(p-n(0))). (24) 

where 0 S T. The minimum value of this will have the same camera indetermi- 
nacies described in section ESI and hence determine a manifold, At, of geomet- 
rically equivalent solutions. A unique solution can be obtained by choosing an 
arbitrary gauge C. 

4.1 Free- Gauge Solution 

Instead of constraining our minimization process with our chosen gauge C, at 
each step we would like to choose a gauge orthogonal to the solution manifold 
At, and proceed in that direction. We expect this to give better convergence to 
the manifold At especially when our desired gauge C has a large oblique angle 



Uncertainty Modeling for Optimal Structure from Motion 209 



to j\4. Once any point on A4 is achieved, it is easy to transform this solution 
into any desired gauge. 

Levenberg-Marquardt (LM) minimization is a combination of Gauss-Newton 
and gradient descent. The gradient of J is obtained as 

VeJ = -2v^n(0)F-i(p-n(e)), (25) 

and the Gauss-Newton approximation for the Hessian: 

J ^E{VgJVgJ^} = 2Vjn(0)V-^Vj'n(0)^. (26) 

Gauss-Newton proceeds iteratively by solving the linear equation: 

V^JAO = -Vgj. (27) 

However, in our case the Hessian, Vg J, is singular due to the ambiguity direc- 
tions with rank n — r. Hence we take steps in the direction: 

AO = -(V2 J)-_,V J, (28) 

which proceeds orthogonally towards the manifold A4. This is called free-gauge 
minimization. To implement LM we add a gradient term. 

At the solution, 9 G Ai, the covariance of the ML estimation of shape and 
motion parameters is obtained as: 

V[9] = E{A9A9^} = 2(V^ J)-_, (29) 

= ^-{VlU{9)V-^VlU{9)^)-_, (30) 

It can be shown that this is identical to the normal covariance expression in 
equation l|T^ . V[9] = and not just up to a geometric equivalence, and 

so we use this as an alternate expression to for the normal covariance. 

4.2 Efficient Covariance Estimation 

The calculation of the generalized inverse in equations (1^^ and (EH involves 
use of SVD which takes 0{n^) operations, and so for many feature points or 
images is slow. The Hessian often has sparse structure and when it is multiplied 
by the gradient, as in LM, the generalized inverse can be avoided and efficient 
minimization methods for J have been proposed m Here, however, we not 
only want a fast LM method, but also an efficient method to estimate the full 
covariance. We propose an efficient method in this section. 

Let us assume that our parameter vector is divided into a shape and a motion 
part, 9g and 9^ respectively, such that 9 — {9j ,9^)^ . The Hessian is then split 
into its shape and motion components: 

f viy V«_J\ U w\ 
yw^vj- 



VlJ = 



(31) 
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When noise in the feature points specified by Vp are independent of each other, U 
and V are full ranl^and sparse with 0{N) and 0{M) non-zero elements respec- 
tively, where N is the number of features and M is the number of images j2|. 
The cross-term matrix W is not sparse however, and so applying a standard 
sparse techniques will not reduce the complexity of determining the generalized 
inverse. 

First we define the full rank matrix T as follows: 
and obtain the block diagonal matrix: 

= (^ 0 y - 

Then we define the covariance Vt[ 6] by: 

Vrie] = {TVlJT^)-_^T (34) 

0 

V 0 {V-W^u-^W);^_, 

where m = 6M is the number of motion parameters. This can be obtained in 
0{N^M + M^) operations which, when when the number of images is small (i.e. 
M <C N), is much faster than the original SVD which is 0{N^ + M^). 

In order for Vr [0] to be a valid description of the uncertainty, we must show 
that it is geometrically equivalent to [^]- Let A = | J be half the Hessian, 
and consider the equation: 

Hx = u (35) 

where u is in the column space of A. The general solution is a combination 
of a unique particular solution, Xj, = H”u, in the column space of A, and a 
homogeneous solution, x?i, which is any vector in the nullspace of A, i.e. Ax-h = 0. 
We left multiply equation (ESJ by T and rearrange to obtain: 

{TAT^)T~^x = Tu. (36) 

Then changing variables: y = T“^x, and solving for y we obtain: 

y = (THT^)“Tu -|- where (TAT^)yh = 0. Now transforming back to x 
we can decompose the solution into the particular and homogeneous parts: 

x = T^(THr^)-ru + T^y,, = Xp + x,„ (37) 

where Xp = is the particular solution obtained in equation ll.'iOll . It is easy 
to see that T^jh is in the nullspace of A, and hence {TAT^)~Tvl = Xp -l-x'^ 
for some vector x'^ in the nullspace of A. 

® U is full rank for affine and perspective projection, but not when homogeneous coor- 
dinates are used as the general projective case, but then we do not obtain Euclidean 
shape. 
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We now apply the geometric equivalence test to [^] = ^ and Vt [0] = 
T^{TAT^)~T. The change of constraint Jacobian is the identity: = I, and 

the orthogonal component to the tangent space of M, is spanned by 

the columns of A and so u is any vector in the column space of A. Applying the 
equivalence relationship we obtain: 

uT(yr _ T^{TAT^)-T)vl = u^(xp - - x'J = u^(-x'J = 0, (38) 

for all u in the column and row space of A, since x)j is in the nullspace. We thus 
conclude that VV[^] can be efficiently estimated and is geometrically equivalent 
to the normal covariance V_\_m [^] ■ 




Fig. 3. Four images of an eleven image sequence with significant noise added and the 
scaled standard deviation of each point illustrated with an ellipse. (The lines connecting 
points are only present for viewing). The synthetic object is shown bottom left. The 
optimal reconstruction, given the noise estimates, is shown on the right with uncertainty 
ellipsoids. These ellipsoids, corresponding to the 3x3 block diagonal elements of a full 
shape covariance, are significantly correlated as shown in the full covariance matrix in 
Figure 13 



5 Results 

We give some sample synthetic and real results illustrating our uncertainty mod- 
eling. A set of features in an image sequence with known correspondences is 
shown in Figure El The synthetic object is also shown along with a sample op- 
timal reconstruction and ellipsoids illustrating feature-based uncertainty. The 
individual feature uncertainties are strongly correlated as illustrated in the sub- 
sequent Figures. 
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The normal covariance for this shape and motion recovery is shown in Fig- 
ure 0 This contains a full description of the uncertainty in the features, but 
to experimentally confirm it using Monte Carlo simulation requires that we se- 
lect a gauge such as that in equation Q. In Figure 0we show the predicted 
covariance obtained by projecting the normal covariance into this gauge using 
equation CH)- Even though the normal covariance and the predicted covariance 
have significantly different values and correlations, they contain the same ge- 
ometric uncertainty (as they are equivalent under the Geometric Equivalence 
Relationship) and will give the same predictions for uncertainties of gauge in- 
variants. Figure |3 also contains the Monte Carlo covariance estimate in this 
gauge, involving 400 SFM reconstruction runs. It is very similar to the predicted 
covariance confirming that our uncertainty model is correct. An easier way to 
visually compare the covariances is to plot the square root of their diagonal ele- 
ments. This gives the net standard deviation in each parameter in this gauge as 
illustrated in Figure 0 




Fig. 4. The predicted normal covariance matrix giving us the geometric uncertainty of 
the reconstructed synthetic object. The scaled absolute value is shown by the darkness 
of the shading. Here weak perspective was used and /r is the recovered scale for each 
image. We note that it can be altered by adding components in the tangent plane to 
M without changing the underlying uncertainty, as we see in Figure 0 

The problem with the shape and motion covariance plots is their dependence 
on choice of gauge. Gauge invariants, however, will give us unambiguous mea- 
sures for the uncertainty of the results. We chose two invariants on our synthetic 
object: an angle between two lines and the ratio of two lengths. Their statis- 
tics are shown in Table Q confirming very good matching between predicted 
uncertainty and actual uncertainty. 

Next we show results for a real image sequence of a chapel in Figure Q along 
with the reconstructed shape from SFM. The feature correspondences were de- 
termined manually. Not only can we obtain a texture-mapped reconstruction, we 
can also obtain measurements of similarity invariant properties such as angles 
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Fig. 5. (Left diagram) The predicted covariance in an arbitrary gauge, see equation 
0 . We note that the values and correlations are significantly different from the normal 
covariance in Figure 0 and yet it still contains the same geometric uncertainty. The 
Monte Carlo estimation of covariance in this gauge is shown on the right. It shows close 
similarity to the predicted covariance in this gauge as can also be seen in Figure 0 




Fig. 6. The square root of the diagonal elements of the covariances in Figure are 
shown here. This gives the net standard deviation of each parameter in the experi- 
mental gauge 0 obtained from the diagonal of the covariance. The solid line is the 
experimentally measured value and the dashed line is our prediction from the projected 
normal covariance. 



with their uncertainties. We found the angle and its uncertainty (in standard 
deviations) between two walls separated by a buttress: 117° ± 3.2°, as well as 
two other angles on the chapel: 46.2° ± 2.1° and 93.2° ± 2.6° as described in the 
Figure caption. These quantities are exact and not only up to an unknown trans- 
formation. We believe reporting this uncertainty measure is essential for most 
quantitative analyzes of the shape, and can only be done for gauge invariant 
properties. 

6 Concluding Remarks 

We have addressed the question of uncertainty representation when parameter 
indeterminacies exist in estimation problems. The shape and motion parame- 
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Table 1. Predicted and measured values, along with their uncertainties in standard 
deviations, of two gauge independent properties of the synthetic object in Figure 0 
(left) the angle between two lines, and (right) the ratio of two lengths. 



Angle Mean Uncertainty 


Ratio Mean Uncertainty 


Predicted: 90.11° ±2.10° 

Recovered: 90.02° ±2.10° 


Predicted: 0.9990 ±0.0332 

Recovered: 1.0005 ±0.0345 




Fig. 7. An image from a 6 image sequence of a chapel is shown on the left with features 
registered by hand. The reconstruction is on the right. We can obtain quantitative 
measures and uncertainties from this reconstruction. In this case we estimated the 
angle between to walls separated by a buttress, and two other angles as illustrated on 
the far right. The values (anti-clockwise from the top) are: 117° ± 3.2°, 46.2° ± 2.1° 
and 93.2° ± 2.6°. 



ters estimated by SFM contain inherent indeterminacies. Hence to apply per- 
turbation analysis, these parameters are first constrained by a gauge and the 
covariance is estimated in this gauge. Unfortunately the choice of gauge will 
have significant effects on the uncertainties of the parameters, as illustrated in 
our results. These effects, however, are “non-physical” and do not correspond 
to changes in the actual geometric uncertainty which is unaffected by an arbi- 
trary choice of gauge. Thus shape and motion parameter uncertainties contain 
artifacts of the choice of gauge. The uncertainties of gauge invariant parameters, 
however, are not affected by these indeterminacies and hence correspond directly 
to the inherent geometric uncertainty. They thus provide unambiguous measures 
for the solution uncertainty. 

We derived a Geometric Uncertainty Relationship which permits us to com- 
pare the geometric uncertainty contained in covariances described under different 
parametrizations and gauges. Using this relationship we showed that the nor- 
mal covariance, whose estimation does not need explicit gauge constraints, fully 
describes the solution uncertainty. We were also able to derive an efficient es- 
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timation method for the solution covariance. Using the Geometric Uncertainty 
Relationship, we showed that this estimate also fully captures the solution un- 
certainty. Gauge invariant uncertainties can be calculated by transforming this 
covariance. 
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Discussion 

Bill Triggs: When you use your efficient reduction technique, the reduction is 
triangular not orthogonal, so you should get a different covariance out at the end 
— it will be gauge-equivalent, but a generalized inverse rather than a Moore- 
Penrose pseudo-inverse. Can you comment on how large the difference seems to 
be in practice, and whether looking at one of these covariance matrices would 
give you a misleading idea of the uncertainties in the other one. 

Daniel Morris: That’s a good question. There are two issues to consider: the 
size of the diagonal elements of the covariance matrix and the size of the off- 
diagonal elements indicating correlation between parameters. The normal co- 
variance will have the smallest trace of all gauge equivalent covariances. For the 
synthetic example we presented, the trace of our efficient covariance estimation 
method was 13% larger than that of the normal covariance, and the trace of 
the covariance of the standard gauge, in which the centroid was fixed, was 30% 
larger than the normal covariance. However, the normal covariance need not 
have small off-diagonal elements, and actually in our example it had 50% larger 
correlation values between shape and motion parameters than our efficient co- 
variance estimate. So numerically our method actually reduced the effect of cross 
terms. Also, while the magnitudes of the elements in the covariances varied sig- 
nificantly between gauges, the overall structure of the covariances did not seem 
to significanly change when different gauges were used. 

Rick Szeliski: This is a more open-ended, speculative question. When you look, 
for example, at things like the individual variances of points, you sometimes fail 
to capture the global things. I guess there are the absolute gauge freedoms that 
exist in these problems, for example the coordinate or centroid freedom, and 
then there are other softer ambiguities, depending on the imaging, the bas-relief 
ambiguity, there are probably others too. I’m just wondering if there is some 
way we can gain intuition, perhaps through visualization, or some other way to 
get a handle qualitatively on what the error is in a given reconstruction. 

Daniel Morris: The question is, can we gain qualitative information from these 
covariances? 

Rick Szeliski: Is there, for example, some way to pull out that some of these 
uncertainties are very highly correlated, to be able to just look at the data set 
and say . . .1 can tell you very well where it is going to be just like this. . .For 
example, where you had the angle between two faces, the one that had most to 
do with how flat the scene was, was the one with the largest error. 

Daniel Morris: It is interesting to speculate as to what kind of qualitative 
information we can obtain from looking at the covariance matrix. In our example 
we see large correlation effects between the rotation parameters and the Z- 
component of shape, and I think this corresponds to the bas-relief effect. So 
you can look at that, and there may be a way of quantifying it, for example, 
by looking at the eigenvalues. I think that’s what you do in your paper on 
ambiguities to determine how correlated the variables are. It is a bit harder, 
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however, to gain qualitative insight into the uncertainty of invariant properties 
such as angles from the parameter covariance matrix. To do that it is probably 
best to directly calculate the covariance matrix of these invariants. 
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Abstract. This paper is focused on error characterization of the factor- 
ization approach to shape and motion recovery from image sequence us- 
ing results from matrix perturbation theory and covariance propagation 
for linear models. Given the 2-D projections of a set of points across mul- 
tiple image frames and small perturbation on image coordinates, first or- 
der perturbation and covariance matrices for 3-D afhne/Euclidean shape 
and motion are derived and validated with the ground truth. The prop- 
agation of the small perturbation and covariance matrix provides better 
understanding of the factorization approach and its results, provides er- 
ror sensitivity information for 3-D affine/Euclidean shape and motion 
subject to small image error. Experimental results are demonstrated to 
support the analysis and show how the error analysis and error measures 
can be used. 



1 Introduction 

The factorization approach to 3-D shape and motion recovery from image se- 
quence, first proposed in reconstructs 3-D shape and motion in affine and 
Euclidean spaces given the 2-D projections of a set of points across multiple im- 
age frames captured by uncalibrated affine cameras 0 . It employs the facts that 
the ideal measurement matrix has rank 3 and can be decomposed as 3-D shape 
and rotation after canceling the translation terms by singular value decomposi- 
tion, and redundant information is good for robust estimation. The introduction 
of the factorization approach for orthographic projection PP was followed by a 
series of extensions to more general camera models, from weak perspective |3], 
para-perspective 0, affine 0 to projective projection |^, and methods based on 
sequential computation Pj, line correspondence jS| and occlusion and uncertainty 
handling [3|. 

* This work is supported by Siemens Corporate Research, Princeton, NJ 08540 



B. Triggs, A. Zisserman, R. Szeliski (Eds.): Vision Algorithms’99, LNCS 1883, pp. 218-|2^^ 2000. 
(c) Springer- Verlag Berlin Heidelberg 2000 



Error Characterization of the Factorization Approach 219 



The factorization approach has attracted significant interest from various 
researchers who have applied it to a variety of tasks. Therefore it is of great 
interest to investigate the sensitivity of the estimated 3D shape and motion 
measurements to perturbations on image measurements. However, to the best 
knowledge of the authors, no thorough attempt was made for the error sensitivity 
analysis specific for this method, although there have been quite a few papers ad- 
dressing error analysis for the structure from motion algorithms 110111112113191 . 

One related work is reported in where a general framework of error anal- 
ysis for structure from motion is proposed and the singular values of the derived 
Jacobian are used for error interpretation. It is argued in m that sensitivity is 
a property of the problem and not of an implemented technique for the problem. 
Moreover, they point out that absolute statements about errors in the output are 
important to understand whether structure from motion is feasible and should 
be pursued. It is also pointed out that the Jacobian for the problem is singu- 
lar due to the nature of the inverse problem and hence the eigenvectors of the 
Jacobian are used to interpret the sensitivity of the results. Our approach falls 
within the general framework in H2]. However our analysis is specific for the 
factorization approach. Both methods are only applicable to small perturbation 
analysis where linearization holds. In addition, if SfM component is one compo- 
nent in a larger vision system, it is important to analyze the sensitivity of the 
implementation and derive conclusions about the specific technique used in the 
system. Thus, an analysis of the a sub-class of techniques that perform SfM is 
necessary. We will show that the perturbation theory on eigensystem, proposed 
and used for error estimation and analysis for 3-D structure from two perspec- 
tive views |1 on 1 | . is also applicable to the factorization approach and leads to 
analytical expressions for the first order shape and motion error measures. 

Our analysis uses a step-by-step error propagation through various stages of 
the factorization approach to characterize the errors at every stage of the pro- 
cess. The various algorithmic stages include: singular value decomposition of the 
observed measurement matrix to identify the 3 dominant singular values and 
corresponding eigenvectors. These correspond to the affine shape and motion 
matrices up to an 3 by 3 affine transform. A subsequent stage applies metric 
constraints to recover the Euclidean shape and rotation matrices. Ambiguity 
still exists in this representation as the reference frame is only known up to an 
arbitrary rotation. To propagate uncertainties through the first stage, we use 
well known results from matrix perturbation theory to derive the expressions for 
the perturbations on singular values and eigenvectors imii 111411,^! . We propa- 
gate covariances through each step by assuming that the input perturbations are 
Gaussian random variables, linearize and do covariance propagation to derive the 
covariances at each step. The ambiguities in the reconstruction at the different 
steps make the notion of covariances in these spaces invalid. For example, for 
an affine shape we can only make relative comparisons of the output deviations 
to determine what points/frames are more sensitive to input perturbations. We 
also point out that angle covariances measuring the degree to which relation- 
ships such as parallelism and collinearity are preserved in the affine shape (when 
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compared to the Euclidean shape) are useful to characterize the degree to which 
the intermediate affine shape/motion values are close to the ideal result. 

Our paper is organized as follows: Section E] discusses the theoretical results 
governing the relationships between input image errors and output errors for 
afhne/Euclidean shape and motion parameter estimates. Section El provides a 
discussion concerning the interpretation of the derived covariance matrices. For 
instance, we provide insights on how errors can be measured given that the 
estimated affine shape is known up to an unknown affine transform. Section 
21 describes experiments we conducted to address three issues: correctness of 
theoretical results, example illustration of how errors can be quantified in affine 
shape/motion, and an illustration of how relative comparisons of reconstruction 
errors can be made to identify the points/frames which are affected most by 
input perturbations. 



2 Error Characterization 

Vectors and matrices will be in bold face. An entity, e.g. the shape, may be 
associated with four variables, the error free shape matrix S, the observed noisy 
shape matrix S, the shape perturbation matrix As, and the first order pertur- 
bation of the shape matrix (5s- According to the definition, we have S = S -t- As 
and S ~ S-|-5s, where ~ denotes equal in the linear terms. Usually the real shape 
perturbation matrix As may not be available and we seek for 5s as the first order 
(linear) approximation of As. A matrix S = [s^J^xn with m rows and n columns 
can also be represented in vector form by column first order (unless stated oth- 
erwise) as S with m X n entities, S = [sn, S 21 , ■ • ■ , Smi, S 12 , ■ ■ ■ , Sm 2 , • ■ ■ , SmnY ■ 



2.1 Problem Formulation 

Recall that the factorization approach recovers 3-D shape (x, y, z coordinates) 
and motion (affine camera parameters) in affine and Euclidean spaces given the 
correspondence of P points across F image frames. Our task is to determine 
the first order approximation of the perturbation and covariance matrices of 
shape and motion given the observations of 2-D projections of P points across F 
frames and covariance of the perturbation on image coordinates. The following 
error analysis of the factorization approach is mainly built upon matrix pertur- 
bation theory mini, especially the theorem derived and proved in Appendix A 
jllJJ, where Weng, Huang and Ahuja used for error analysis of their motion and 
shape algorithm from two perspective views. The basic ideas are propagating the 
small perturbation on image coordinates, i.e. the perturbation on the registered 
measurement matrix, to the three dominant singular values and corresponding 
eigenvectors, therefore to the afhne/Euclidean motion and shape matrices, and 
using the covariance matrices as a vehicle for error measures and applications. 
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2.2 AfRne Shape 

Let m = 2 X F and n = P, we first investigate how the perturbation matrix 
Aw affects the three largest singular values Ai,A 2 ,As and the corresponding 
eigenvectors Vi,V 2 ,V 3 of an m x n measurement matrix W, and thus on the 
affine shape Sa = AV"^, where the singular value decomposition of W is 

[W]^xn = UAV^ [Ma]™x3[Sa]3xn. (1) 

Before we proceed further, matrices W and Aw are scaled down by the 
maximum absolute value of W, c = max\w{i, j)\,i = 1, . . . ,m,j = 1, . . . ,n. This 
guarantees the maximum absolute value of Aw is sufficiently small. The scale 
factor c is later put back to the computed singular values and the perturbation 
on the singular values. 

Let A be an n X n symmetric matrix defined as: 

A = W'^W = (UAV'^)'^(UAV'^) = VA^V'^, (2) 

we have AV = A^ = diag{Xl, A^, . . . , A^}. Matrix W is usually a rectangu- 
lar matrix, pre-mult iplying with its own transpose yields a symmetric matrix A 
so that the Theorem proved in mg holds. 

According to the Theorem m3) *^6 first order perturbations on the three 
most significant eigenvalues and corresponding eigenvectors of matrix A are: 

= vJ^AaVi, 

= VAiV^AAVi, (3) 

where Vj are the column vectors of V = [vi , . . . , v„] , and the diagonal matrix 
Ai = diag{{\'^ — Af)“^, ... ,0, ... , (Af — A^)“^} with the i-th diagonal element 
as 0, and i = 1, 2, 3. 

The unknowns of AAVi in (0 can be written as 

Aa^i = [Uiiln, Vi2^m ■ ■ • ; — Hjt^A, (4) 

by rewriting the n x n matrix Aa as the column-first vector form (5 a with n? 
elements, where vij is the j-th element of vector Vi and In is n x n identity 
matrix. Hi is an n x matrix with 1 by n submatrices Vikln, k = 1, 2, . . . , n, 
each with n x n elements. 

5a can be further associated with 5w) so that we have a close form solution 
for and 5v, as a function of 5w- From the matrix representation of the 
perturbation on matrix A 

Aa = (W -b Aw)^(W -b Aw) - W^W, (5) 

the first order approximation 

[AA]„xn - + [A^]„x^ [W] 



( 6 ) 
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is rewritten as its corresponding vector form 



= \F„ 



Gs 



. n^Xl 



S-w 



- mnx 1 



( 7 ) 



where matrix Fg = [Fy] has n by n submatrices of Fy 




if i = j 
if i^i 



and 



matrix Gg = [Gy] also has n by n submatrices Gy with the j-th row being the 
i-th column vector of W and all other rows being zeros. More specifically, 



Fg = 



0 

0 



0 ■ 
0 

WT 



Wll ■■■ Wml 



0 ••• 0 



( 8 ) 



Gg = 



0 



Win 



0 



Wmn 



Wll ■■■ W„,i 

0 •• • 0 



( 9 ) 



L 0 ••• 0 



'^In ‘ ‘ ‘ '^mn - 



Then the perturbations on the eigenvalues and eigenvectors of A subject to small 
perturbation on W are: 



— V?" Hi(Fg + Gg)(5\v — 

^ VAiV'^Hi(Fg + Gg)^w = DvMw, (10) 



where is a vector with mn elements and Dvj is a matrix with dimension of 
n X mn. 

Without any constraints on the statistical structure of the small perturbation, 
we get the covariance matrix of Vj, i, j = 1, 2, 3, from C3), 

Fv,vj = E{5v,<5v/} = (11) 

where E denotes expectation and F.^ is the covariance matrix of the measure- 
ment matrix, and the variances of the squares of the singular values A? for 
i = 1,2,3: 

D^F^Dv, (12) 

Under the assumption of identical and independent Gaussian perturbation with 
zero mean and variance cr^ for all image points and their components, F.^ = 
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o’^Imn, simpler representations are possible: 

Tvi = cr^DviDvi^, 

Tv.vj = (13) 



It is well known that the real structure of the errors on 3-D shape and mo- 
tion is neither independent nor Gaussian distribution. The covariance matrices 
(CD and (HD are just Gaussian approximation of the real structures for small 
perturbation. This simplification could lead us to simplified representations and 
error measures. 

At last, we have the first order perturbation on affine shape Sa = AV"^ sub- 
ject to small perturbation on image coordinates, by combining the perturbations 
on singular values and eigenvectors of W : 



As. ~ (5s. = 






3A3V; 

s 



Ak5T, 

A2(5J 



A3<5: 



¥ 

V3 J 



(14) 



where S\. = + Sx^ — ~ ^ (for |^| << 1) are the perturbations on 

the singular values of W, and their covariances are and = 

Govariance matrix for affine shape Fs^, 






(15) 



0"AiViVi^ -I- A^Fvi -h AiA2Fvj^V 2 Cr^iAaViVs^ -I- AiAsFvivs 

0-AiA2V2Vi^ -I- AiA2Fv2Vi CrA2V2V2^ -f A^Fv^ CTA2A3V2V3^ -f A2A3FV2V3 
.o-AiAsVaVi^ -I- A1A3F V 3 V 1 0-A2A3V3V2^ + A2A3F V 3 V 2 CA3V3V3^ + A|F V3 J 

then can be derived from dH after fitting i5sa into its vector form by row 
first order, Ss^ = [(5 aivJ' -h Ai^J^,(5a2vJ -h A25^2’'^>3vJ + A3(5!^3]'^, with the 
assumption that E{(5vi} = DvjE{(5w} = 0- 

The representation of Fs^, in provides us various error measures for 
affine shapes, from high level error sensitivity summary for affine shape ||<5sall ~ 
^ytrace{Ts^} to fine-grained error sensitivity measure for a specific point, say 

the k-th point, ||^s>‘ll ~ ^Jtrace\T^w} where covariance matrix Fgk is an 3 x 3 

matrix extracted from rows and columns of k, (n + k), (2n -I- k) of covariance 
matrix Fg^. ||(5sk|| gives us relative error measures for the shape points, e.g. 
indicating which points are more sensitive to image errors than the others. 

It is worth to notice that we are seeking for 3 most significant simple eigen- 
values, which are usually distinctive and not close to 0, in the first order per- 
turbation analysis. So the inverse terms, e.g. (A? — Af)“^ in matrix Ai in (OJ, 
are usually quite stable. Recall that the smallest simple eigenvalue (very close 
to 0) and corresponding eigenvector are sought in pi D) where numeric robustness 
is a big concern. As the ground truth of W is usually unavailable in real appli- 
cations, we use the observed W, Ai and V instead to estimate the first order 
perturbations. 
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2.3 AfRne Motion 

Follow the error analysis for affine shape, and let 

B = WW'^ = (UAV'^)(UAV'^)'^ = UA^U"^ 



(16) 



be an 771 X m symmetric matrix. It can be shown that the perturbations on the 
eigenvalues and eigenvectors of B subject to small perturbation on W are: 



~ UA;U'^Ji(F™ + G^)Sw = T>uM, 



(17) 



where Da; is a vector with mn elements, is a matrix with dimension of tti x 
77177, Ui are the column vectors of U= [ui, . . . , Um], J, = [wiilm) Wi 2 lm, ■ • ■ > Uzmim] 
is an m X m? matrix with 1 by m submatrices itifclm, fc = 1, 2, . . . , tti, each with 
m X m elements, Aj = diag{{\^ — A^)“^, ... ,0, ... , (Af — A^)“^} has the i-th 
diagonal element as 0, 7 = 1, 2, 3, matrix Fm = [Fy] has m by n submatrices Fy 
with the i-th column being the j-th column vector of W, and matrix Gm = [Gy] 
also has m by n submatrices Gy with Gy = Wylm, *-e.. 



F — 

-L m — 



mil 



Wml 



0 



Wlr, 



W„ 



0 



mil 



Wml 



Wlr. 



W„ 



(18) 



G^ = 



■ Wiilm 


Wl2lm 






W2llm 


W22lm 


' ‘ '^2'nXm 


(19) 


- Wml^m 


Wmcim 


Wruri^iLn. - 




covariance 


matrix of 


Ui, 




= 


= Du. 


r - D ^ 






/} = Du 


r - D ^ 


(20) 



Again under the assumption of identical and independent Gaussian perturbation 
with zero mean and variance for all image points and their components, 
(121)11 can be simplified as: 



F - — 

^ VV — ^ ^mn; 



Fu. = cr^Du.Dui^, 

Fu.Ui = cr^Du.Du,^. 



( 21 ) 



The covariance matrices of (121 )ll and m are just Gaussian approximation to the 
error structure on motion matrix. 



Error Characterization of the Factorization Approach 225 



As affine motion matrix Ma = [ui , U2 , U3] is composed of three eigenvectors 
corresponding to the three most significant eigenvalues of B, the perturbation 
on affine motion matrix subject to small perturbations on image coordinates is 

5m. = [^m,5u,,<yu3], (22) 

where 5uj are given by (II 711 . Thus the covariance matrix for affine motion Fm. 
can be derived from (E21 after fitting Jm. info its vector form iJm. by column 
first order, ^ 

TV r 

Fu2Ui Fu2 Fu2U3 • (23) 

r r r 

Similar to Fs. in (I I 51) . the representation of Fm. in (l‘2.'tll also provides various 
error measures for affine motion, from high level error sensitivity summary for 
affine motion ||5 m. || ~ y/trace{F]y[.} to the error measure for a specific frame, 

even a specific axis, such as ||5]v[k|| Ri ,Jtrace{T-^k\ for the k-th frame, where 
covariance matrix F]y[k is an 6 x 6 matrix extracted from rows and columns of 
2k, {2k + 1), (m + 2k), (m + 2k + 1), {2m + 2k) and {2m + 2fc + 1) of Fm.- 
I|5m''II tells us which camera frames are more sensitive to image errors than the 
other frames. The error measures for the x-axis, y-axis and z-axis are just ||(5ui ||> 
||5u2l| and H^usH, respectively. 



Fm. = = 



2.4 Euclidean Shape and Motion 

Euclidean motion and shape can be recovered by camera auto-calibration Emu 

ihi 



An invertible 3x3 matrix Q is sought, such that 



n; 



corresponds to the real 



camera matrix for frame i, where ihi and hi are the (2i)-th and (2i+l)-th row 
vectors of the affine motion matrix Ma> and i = 1, ... ,F. The general metric 
constraints for affine cameras are p| 

m^QQ^ihi m?^QQ^ni 
m?^QQ^ni n^QQ^fii 



= A;A;", 



(24) 



where Ai is the intrinsic matrix for frame i. Solving matrix Q, which can be 
done by linear method, yields the solutions of motion and shape matrices in 
Euclidean space, i.e 

Me = MaQ, Se = Q~^Sa. (25) 

The recovered shape and motion in Euclidean space are still subject to a global 
scale factor and rotation to be registered with the reference coordinate system 
where the ground truth resides. 

When Q and are given (which is case when evaluation based on ground- 
truth is done), the first order approximation on the Euclidean shape and motion 
errors can be simplified as 

5m. ~ 5m.Q, 5s. ~ Q ^5s.- 



(26) 
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By rewriting it as vector form 

I^Me ~ Dq<?MaI ~ Dq-ii5sa, (27) 

we have the close form solution for the covariance matrices of Euclidean motion 
and shape 



rMe ~ DgEMaDq, 

rsa«Dq-rrsaDT_,. (28) 

Under the assumption of = cr^Imn) it is further simplified as: 
rMe ~ CT^DqDuiD^.Dq, 

Tsa ~ fT^Dq-iDv^D^D^^i. (29) 

The interpretation of the error measures for Euclidean shape ||(5sal| ~ 
~ ^ytrace{Ts^}, H^s^H ~ Y^trace{rgk}, k = and Euclidean motion 

||(5Mell ~ \/ trace{TM^} , H^m^II ~ y^trace{r]y[k}, k = are similar to 

the affine counterparts addressed in Section E3 and E3 The only difference is 
that they are more constrained and only a rotation ambiguity is left. 



3 Discussion 



In this section we discuss key issues addressing: 1) better applicability of the 
matrix perturbation theory results to factorization analysis over the original 
application discussed in llbim, 2 ) the use of the error predictions for relative 
comparisons of sensitivities of output terms to input perturbations, and 3) the 
interpretation of errors in affine shape, meaning of covariances. 

The error characterization approach using the first order approximation is a 
reasonable approximation to the Jacobian only when the input perturbation is 
small. And how small should it be really depends on the scene structure, camera 
configuration, and image noise. One of measures could be 



e = 




Xo — A3 



1 

A3 ’ 



(30) 



where Ai > A 2 > A 3 are the three dominant singular values of the measurement 
matrix. When input perturbation is above certain level, the linear approxima- 
tion is no longer applicable and the higher order terms will dominate. However, 
we are only interested in perturbation on the three most dominant singular val- 
ues and the corresponding eigenvectors, which is more robust than seeking the 
perturbation on the smallest singular values (usually very close to zero) such as 
the situation attacked in jlDll 1 ) . 

It also worths notice that the performance measures derived in (iniiisiEs 
Eini2(128ll are subject to a specific eigensystem decided by the scene and camera 
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configuration. When the cameras and shape are fixed, the eigensystem of the reg- 
istered measurement matrix is also fixed. Due to ’’small” perturbation on image 
measurement, not big enough to perturb the eigensystem, we have observations 
in affine/Euclidean space subject to an unknown but fixed affine/rotation trans- 
form. So the local covariance used for relative measures are valid and valuable for 
relative comparison purposes. For example, by comparing ||rsi_||, i = 
and llTj^j II, j = 1,...,F, questions like which 3-D point/frame suffers more 
error than the others subject to the specific scene structure, camera configura- 
tion and image noise can be answered quantitatively. This information is already 
enough to tell us which point /frame has the worst performance in the group and 
should be removed first if necessary. Making the same relative comparison also 
helps us to include further tracking points and frames into the shape and motion 
reconstruction. An iterative process of the comparisons of the error measures are 
helpful for picking the right points and frames for performance improvement. We 
are safe as long as the structure of the eigensystem is preserved. 

Since, the estimated affine shape is only defined up to an affine transform, 
one could ask the question whether the above covariance matrix for the errors 
i5sa is El meaningful quantity to compute. This matrix would only allow relative 
comparisons since all errors have undergone the same affine transform. Another 
interesting question that could be asked is on how one could compare various 
trials wherein the image perturbations are large enough so that the eigensystems 
are perturbed (e.g. the eigenvectors corresponding to the dominant singular val- 
ues permute). To answer this question, we use the fact that the same unknown 
affine transform is applied to every point in the shape matrix. This means that 
relationships such as parallelism and collinearity for tuples of points in the ideal 
input Euclidean shape should be preserved after the affine transform. Thus, it 
makes sense to characterize the error between the difference in the angles of point 
tuples that satisfy the parallelism/collinearity constraints in the original input 
shape and estimate the precision parameter for these angle differences. In fact, 
if two pairs of lines have the same angle between them in Euclidean space, their 
angles in the affine space would remain the same (since they are undergoing the 
same unknown affine transform). This fact will be used to characterize errors 
in the affine motion estimates as well. In this paper, we concentrate mainly on 
relative comparisons that can be made when the perturbations are small enough. 

Another possibility is to remove ambiguity by registration of the estimates 
for meaningful comparisons. After fixing the affine/rotation ambiguity, the error 
measures can be even used for comparison of different shape undergoing same 
motion, or same shape undergoing different motion. Suppose two factorization 
results are achieved from the same camera configuration and motion, Wq = 
MqSq, and = M^S^. There is a transform T (an affine transform in affine 
space and a rotation transform in Euclidean space) between Mq and M^, and 
can be solved from linear system Mq, = M^T. With the transform T fixed, 
shapes Sq, and and the corresponding error measures are comparable. 

The same approach applies to the situation where shape is fixed. When both 
scene structure and camera configuration are changing or the image perturbation 
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is too large, the error measures with respect to different eigensystems do not 
provide any meaningful insight for shape and motion distortion. 

4 Experiment Results 

Experiment results are shown in this section to support the above error analysis 
and how the results can be used to bring in insights and improve performance. 

The first simulation is to prove that the derivations of d and d based on 
perturbation theory are correct and valid for small perturbation. A 3-D model 
VENUS with 711 nodes is projected on 10 image frames, and the point correspon- 
dences are perturbed by identical and independent Gaussian noise. The observed 
3-D shape and motion errors from the factorization approach are compared 
against the first order approximations from our error analysis. A close matching 
of them would be a proof of the derivations. For this purpose, the VENUS model 
is first normalized in the unit cube {x G (—1,1],?/ G (— 1,1],2: G (—1,1]}. Ten 
orthographic cameras are simulated, all targeting at the origin. Five of them are 
distributed uniformly on the unit circle on XZ plane and the other five on the 
unit circle on YZ plane. The 3-D model and its 10 projections are shown in Fig. □ 
Identical and independent Gaussian noises are added to the point coordinates in 
all frames with size 512 x 512, ^ A^(0,4.0), Z\„ ^ A^(0,1.0). The 711 points 

across 10 frames and the perturbations are fit into the measurement matrix. 
Affine shape, affine motion, and the first order perturbation on them are calcu- 
lated based on the factorization approach and the above error analysis. The com- 
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Fig. 2. Norm of computed error (solid line) versus norm of hrst order perturbation 
(dotted line) for the VENUS model, (a) ||Asa|| ~ ll^Sall from HI 41 : (b) ||AMa|| ~ II^Mall 
from 1221) . 



parisons of the observed shape and motion errors, ||Sa — Sa|| and ||Ma — Ma||, 
against the derived first order perturbations on affine shape and motion, ||(5sall 
and ||<5]v[all) Eire shown in Fig. El for 25 random trials, where matrix norm is de- 
fined as root sum of squares of matrix components, e.g. ||S|| = ^(j, j)^. It 

is easy to see the strong correlation between the two curves. Given Q and Q“^, 
||(5sall and ll<5Mell have similar matches. 

Next illustration of errors in affine space is demonstrated to check the par- 
allelism relationship. A total of 13 random points are generated in unit cube, 
where 4 of them specify 2 parallel lines. They are projected on 5 image planes 
(512 X 512) by the orthographic cameras in Table d Samples from Gaussian 
distribution with zero mean and standard deviation ranging from 0.1 to 1.5 
are simulated as image errors. The errors propagate through the factorization 
approach and perturb the orientation of the two lines used to be parallel in Eu- 
clidean space. The angles in degrees between the two lines are calculated and 
repeat for 1000 trials to collect the statistics of angle variance. And the result is 
shown in Fig. 0 The relationship of parallelism is serves as an interpretation of 
the errors in affine space. 

At last, we demonstrate how the shape and motion error measures derived 
before can be used for relative comparisons. 11 points in unit cube and 5 or- 
thographic cameras are simulated, and details are listed in Table El and m The 
11 points are projected on the 5 image planes (512x512). Small perturbations 
are added to the image coordinates of the corresponding points. Assuming the 
noise has identical and independent zero mean Gaussian distribution with stan- 
dard deviation a = 0.1, 0.5,1, the shape and motion error measures in affine 
space are shown in Fig. 0 and those in Euclidean space are shown in Fig.El By 
comparing the error measures of ||<5sj,||, ||, ||<5s^|| and ||(5j^j ||, f = 1, . . . , P, 

j = 1, . . . P, we can draw conclusions such as point 11 is less sensitive to image 
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Fig. 3. Estimated standard deviation of the angles between two lines in affine space 
subject to the errors on image coordinates. 



noise than point 4, and frame 3 is more sensitive to image noise than frame 1. 
This kind of information is already enough to identify which point/frame has 
the poorest error sensitivity performance, and can be used for integrating fur- 
ther points and frames for performance improvement. Furthermore, not only the 
3-D shape and motion but also the relative error sensitivity information can be 
recovered, and visualized if necessary, given the error perturbation model on the 
correspondences. 



Table 1. Affine and Euclidean motion errors. Five orthographic cameras are simulated 
on Z-X plane targeting at origin point. The 11 points are captured on the five 512 x 512 
image frames. Identical and independent zero mean Ganssian noises are added to image 
coordinates of the corresponding points. Given the standard deviation of u = 0.1, 0.5, 1 
on image coordinates, the estimated corresponding affine motion error and Euclidean 
motion error are listed in the next 6 columns. 





Camera Matrix 


Affine Error | 


II 


Enclidean Error || 

iVlg 


i 




cr = 0.1 


cr = 0.5 


a = 1.0 


cr = 0.1 


a = 0.5 


a = 1.0 


1 




104.1 0 104.1 255 

0 -147.2 0 255 




3.00 


6.717 


9.499 


5.996 


13.406 


18.959 


2 




136 0 56.3 255 

0 -147.2 0 255 




3.26 


7.29 


10.309 


6.556 


14.66 


20.732 


3 




147.2 0 0 255 

0 -147.2 0 255 




3.424 


7.656 


10.828 


6.928 


15.492 


21.909 


4 




136 0 -56.3 255 

0 -147.2 0 255 




3.422 


7.651 


10.82 


6.946 


15.532 


21.965 


5 


104.1 0 -104.1 25f 

0 -147.2 0 25f 


) 

) 


3.257 


7.282 


10.298 


6.607 


14.774 


20.894 
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Table 2. Affine and Euclidean shape errors. Eleven points are distributed asymmetri- 
cally in the unit cube, 10 on Z-X plane and 8 on Z axis. Identical and independent zero 
mean Gaussian noises are added to image coordinates of the corresponding points on 
512 X 512 images. Given the standard deviation of cr = 0.1, 0.5, 1 on image coordinates, 
the estimated corresponding affine shape error and Euclidean shape error are listed in 
the next 6 columns. 



Point 


(x,y,z) 


Affine Error || 


II 


Euclidean Error ||(5gi || 


i 




cr = 0.1 


a = 0.5 


a = 1.0 


cr = 0.1 


cr = 0.5 


cr = 1.0 


1 


(0,0.28,-!) 


1.16314 


2.60086 


3.67818 


0.718179 


1.6059 


2.27108 


2 


(-0.28 0 -1) 


1.18246 


2.64407 


3.73928 


0.655924 


1.46669 


2.07421 


3 


(-0.14 0 -1) 


1.04991 


2.34766 


3.3201 


0.605653 


1.35428 


1.91524 


4 


(0 0 -0.98) 


1.13453 


2.53689 


3.5877 


0.632423 


1.41414 


1.9999 


5 


(0 0 -0.84) 


1.12433 


2.51407 


3.55543 


0.634903 


1.41969 


2.00774 


6 


(0 0 -0.7) 


1.11049 


2.48312 


3.51166 


0.632942 


1.4153 


2.00154 


7 


(0 0 -0.56) 


1.09287 


2.44373 


3.45596 


0.626499 


1.4009 


1.98117 


8 


(0 0 -0.42) 


1.07129 


2.39547 


3.38771 


0.615433 


1.37615 


1.94617 


9 


(0 0 -0.28) 


1.0455 


2.33781 


3.30616 


0.599488 


1.3405 


1.89575 


10 


(0 0 -0.14) 


1.01518 


2.27002 


3.21029 


0.578261 


1.29303 


1.82862 


11 


(0 0 0) 


0.979918 


2.19116 


3.09877 


0.551142 


1.23239 


1.74286 



5 Conclusion 

We have derived the first order perturbation and covariance matrices for affine/ 
Euclidean shape and motion subject to small perturbation on image coordinates 
based on matrix perturbation theory, and used them as a vehicle for error mea- 
sures in applications. Step-by-step error analysis and propagation based on ma- 
trix perturbation theory and covariance propagation are derived and validated. 
Interpretation of the errors in affine space is also addressed. Relative error mea- 
sures derived from local covariance matrices are used to identify what output 
points/frames are sensitive to image measurement errors. 
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(a) (b) 



Fig. 4. Error measures for (a) affine shape ||(5gi || and (b) affine motion ||. It tells 
us, for exampfe, point 4 is more sensitive to image error than point 11 and frame 3 is 
more sensitive to image error than frame 1. 




(a) (b) 



Fig. 5. Error measures for (a) Euclidean shape ||<5gi || and (b) Euclidean motion ||(5gj ||. 
It tells us, for example, point 4 is more sensitive to image error than point 11 and 
frame 3 is more sensitive to image error than frame 1. 
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Discussion 

Last-minute visa problems prevented Zhaohui Sun from attending, so Terry 
Boult gave the talk. A response from the authors is also included below. 

Kenichi Kanatani: Why do you bother with covariance propagation? If you 
do covariance propagation and you do simulations and your analysis agrees with 
the experiment, that proves only that your analysis is correct. To me, what is 
more important is to analyze the theoretical accuracy bound, based in this case 
on the Fischer information. If you compute the Fischer information and do the 
simulation and your results agree with the theoretical analysis, then your method 
is optimal, otherwise it is not. If you once confirm that the method is optimal 
you can use the Fischer information as the potential, or characteristic, of noise 
behaviour instead of doing covariance propagation. So to me Fischer information 
is more essential than covariance propagation. 

Terry Boult: This is not my paper, so I’m not going to put on a hat and 
say I believe this or that is the right approach. But I can answer partially, 
because we talked about this during visits to Siemens. This paper does covariance 
propagation using a first-order approximation to the covariance. That’s good, 
but it isn’t as useful as propagating the exact covariances would be. But they 
don’t know how to do that yet. 

Kenichi Kanatani: The next question is very simple. I think you can do all this 
covariance propagation analysis using software, using automatic differentiation 
tools. Why bother with these analytical or Taylor expansions? 

Terry Boult: Well, for a particular set of data you could simulate for a long 
time and say yes, I can do numerical simulation and estimate the covariance by 
just taking the inputs and outputs. The advantage of doing it analytically is that 
the approximate output covariance can be directly predicted for any particular 
set of matrices and cameras. 

Kenichi Kanatani: Not numerical simulation. There are software tools avail- 
able to automatically differentiate given formula expressions. 

Terry Boult: Do those tools automatically give the error on the eigenvectors 
of a matrix? - I’m unaware of such tools. If you can give me a reference I’ll pass 
it back to the authors. 

Rick Szeliski: One of your motivations was to identify points that have a lot 
of uncertainty and throw them out. But when you triangulate a point that is far 
from you, you typically do have a lot of uncertainty. Yet you don’t usually throw 
it away, you simply declare that it was not very reliably measurable in absolute 
terms. Could you clarify that? 

Terry Boult: Personally, I agree with you. I wouldn’t throw points out just 
because they are noisy. But the SFM problem is completely interspersed, shape 
and motion come together, so you may be better off temporarily throwing out 
any especially troublesome points, computing the stabler ones, then recalculating 
the difficult ones by some other technique. But as I said, the importance of a 
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feature, just because it is distant or possibly unreliable, is different from its 
noisiness. 

Joss Knight: Presumably you could use this in robust correspondence tech- 
niques. Incorrectly matched points would have a huge covariance error, and you 
could just get rid of them and recalculate. 

Terry Boult: It’s not clear. If you redid this analysis with some points that were 
very incorrectly matched, the eigenstructure would become very different and 
the perturbation expansion, at least in theory, would no longer apply. So until 
someone has implemented and experimented with it, it is not clear how valuable 
this approach would prove. What happens is that as soon as you get some false 
correspondences, your fourth eigenvalue increases in value very quickly. 

The authors: Note that our paper illustrates how small perturbation analysis 
can be applied to SVD based estimation schemes. This is of general interest for 
a whole range of vision problems. 

Prof. Kanatani’s question is interesting. The Cramer-Rao bound does of 
course give the optimal minimum variance unbiased estimator. But in the broader 
context of vision systems, accuracy is not the only requirement. The design en- 
gineer also has computational constraints to meet, and procedures with sub- 
optimal accuracy may be preferred. Covariance propagation makes perfect sense 
in this context. To determine how system design choices affect the total system, 
we need to study the performance as a function of input data, algorithm and tun- 
ing parameters. Our philosophy is that each vision algorithm or module should 
be treated as an estimator (linear or non-linear depending on the sub-task) and 
characterized in terms of the bias and covariance of its estimates. Suppose SFM 
is one module of a vision algorithm chain involving point extraction, tracking, 
SFM {e.g. affine shape estimation), and image-based rendering. To predict the 
final output (image-based rendering) error, we use covariance propagation on a 
bias-covariance characterization of the error of each module in the chain. The 
chosen SFM module may satisfy the end-to-end system accuracy and speed re- 
quirements without being optimally accurate. 

Rick Szeliski and Joss Knight ask the related question of how the covariance 
estimates should be used in practice. We have not yet explored the practical 
implications of the theoretical results very thoroughly, but feature selection based 
on error measures is certainly a possibility. Several issues remain: 1) The error 
analysis uses small perturbation assumptions. Large deviations {e.g. incorrectly 
matched points) may lead to large changes in the eigensystem, whose effect has 
not been analyzed. 2) In principle, Rick’s observation that one has to keep some 
points with large variance is correct (as long as there are no matching errors). 
The point covariances are functions of the geometry (the depths of the 3D point, 
the camera view points) . This makes it difficult to tell whether a large variance 
was due to insufficient (ill-conditioned) data or outliers (incorrectly matched 
points). 3) The uncertainty of the image features is a function of the operators 
used to detect and track them and the underlying geometric and illumination 
models. Heteroscedasticity is to be expected. 




Bootstrapping Errors-in- Variables Models 



Bogdan Matei and Peter Meer 



Electrical and Computer Engineering Department 
Rutgers University, Piscataway, NJ, 08854-8058, USA 
matei, meer@caip.rutgers.edu 



Abstract. The bootstrap is a numerical technique, with solid theoreti- 
cal foundations, to obtain statistical measures about the quality of an es- 
timate by using only the available data. Performance assessment through 
bootstrap provides the same or better accuracy than the traditional error 
propagation approach, most often without requiring complex analytical 
derivations. In many computer vision tasks a regression problem in which 
the measurement errors are point dependent has to be solved. Such re- 
gression problems are called heteroscedastic and appear in the lineariza- 
tion of quadratic forms in ellipse fitting and epipolar geometry, in camera 
calibration, or in 3D rigid motion estimation. The performance of these 
complex vision tasks is difficult to evaluate analytically, therefore we pro- 
pose in this paper the use of bootstrap. The technique is illustrated for 
3D rigid motion and fundamental matrix estimation. Experiments with 
real and synthetic data show the validity of bootstrap as an evaluation 
tool and the importance of taking the heteroscedasticity into account. 



1 Introduction 

No estimation process is complete without reliable information about the accu- 
racy of the solution. Standard error, bias, or confidence interval are among the 
most often used statistical measures, however, in practice their computation is 
difficult especially when there is no information available about the distribution 
of the noise process and the ground truth. 

The traditional approach to error analysis in computer vision is error propa- 
gation m pp.l25-164]|HEn]. For highly nonlinear transfer functions the validity 
of error propagation is restricted to a small neighborhood around the estimate. 
Analytical computation of the required Jacobians is often very difficult. 

In this paper we make extensive use of a new paradigm for error analysis 
with solid theoretical foundations, the bootstrap. Being a numerical technique, 
the bootstrap can substitute the analytical derivations of the error propagation 
with simulations derived exclusively from the data. More importantly, since the 
error propagation is only a first order approximation, the bootstrap is more 
accurate and has a larger applicability 0 pp. 313-315]. Though the bootstrap 
principle looks similar to Monte Carlo simulations, the main difference between 
them is that the former uses only the corrupted data, while the latter needs 
ground truth information. 



B. Triggs, A. Zisserman, R. Szeliski (Eds.): Vision Algorithms’99, LNCS 1883, pp. 236-|2^2| 2000. 
(c) Springer- Verlag Berlin Heidelberg 2000 
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The bootstrapping of a regression model, a distinct topic of the bootstrap 
methodology, was widely investigated in numerous references, yet most of the 
work was focused on ordinary regression where the explanatory variables are 
assumed free of measurement errors. The vast majority of regression problems 
encountered in computer vision applications do however have errors in the ex- 
planatory variables. What complicates further the estimation and evaluation of 
these errors-in-variables (EIV) models is the fact that each measurement might 
have a different uncertainty about its true value. The linearization of quadratic 
or bilinear forms encountered in ellipse fitting, respectively epipolar geometry, or 
the 3D rigid motion with the 3D data extracted from stereo, all yield point de- 
pendent measurement errors. The EIV regression having point dependent mea- 
surement errors is called heteroscedastic (HEIV) regression. By extension the 
point dependent errors are called next heteroscedastic noise. 

Since the bootstrap requirement of i.i.d. data is not respected by the HEIV 
model, the data can not be used directly in the resampling process. In this paper 
we propose a versatile technique in which the noise affecting the measurements is 
recovered by an estimator taking into account the heteroscedasticity. The noise 
process is then transformed to obey the requirements of the bootstrap. 

The problem domains chosen to illustrate the bootstrap based error analysis 
paradigm are the 3D rigid motion and fundamental matrix estimation. The paper 
is organized as follows. In Section E| the bootstrap paradigm is introduced. The 
analysis of rigid motion using bootstrap is presented in Section E| Bootstrap 
based performance assessment of the fundamental matrix estimation is described 
in Sectional 

2 Bootstrap 

The bootstrap is a resampling method which extracts valid statistical measures 
like standard error, bias or confidence intervals in an automatic manner by means 
of computer intensive simulations using only the available data. Since its intro- 
duction in the late 70’s by Bradley Efron, the bootstrap has evolved into a very 
powerful tool supported by numerous theoretical studies. Though the underly- 
ing principle is fairly simple, the use of bootstrap in a practical problem should 
always be preceded by a careful analysis. When the assumptions (to be spec- 
ified below) upon which the bootstrap is based are violated inconsistent and 
misleading results may be obtained. A thorough introduction to the bootstrap 
methodology is the textbook of Efron and Tibshirani 0, while additional ma- 
terial can be found in A short review is given next. 

2.1 Introduction to Bootstrap 

A first condition for the validity of the bootstrap procedure is that the available 
data points {zi, Z 2 , • • • , Zn} are i.i.d. Their distribution F is assumed unknown. 
In the absence of any prior information the empirical distribution F, obtained by 
assigning equal probability 1/n on each measurement Zi, is used as a representa- 
tion of the true distribution F. The data is employed to estimate a p-dimensional 
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statistics 9 = 9{F). Let 6 — g{zi, Z 2 , • • • , Zn) be such an estimate. If the prob- 
ability distribution was known, the computation of the bias or covariance of 9 
would be immediate and could be achieved by either a theoretical derivation or 
Monte Carlo simulations as 



Mf(^) “ — 9, = Ep 



- N T' 



(9-Ep[9]^ (9-Ep[9]^ 



( 1 ) 



where Ep[-] is the expectation under the probability distribution F. Since F is 
unknown, the bootstrap approximates ([fl) by resampling the data from the avail- 
able F distribution. The value z* is drawn with replacement from {^i, 2 : 2 , • • • , Zn} 
By repeating this resampling n times a bootstrap sample or bootstrap set, denoted 
by {jz*, Z 2 , • • • , z’^}, is generated. Let 9 — g{zl,Z 2 , • • • , z’^). The bootstrap ap- 
proximation to m is 



9'pi^) = Ep{^ ] - 



cov p{9) = Ep 




■ (2) 



In practice, the sample moments are used to approximate by generating B 
bootstrap sets. Thus, 



fip{9) ^ fig = 9* - 9 9* = ^'^9*\ 



COVy 






— * 





(3) 


6=1 


(4) 



6=1 



The bias corrected covariance matrix of 9 is 



9-0 9-0 



(5) 



The two approximations, the substitution of E for F and the replacement of the 
sample moments for the true ones, are the main sources of inaccuracy of the 
bootstrap technique. If the number of measurements is too small (say below 20), 
the variability of the bootstrap estimates becomes too large for the results to be 
reliable. 

When the assumption of i.i.d. data is not valid the bootstrap may fail. This 
means that the outliers in the data must be removed before the resampling 
is performed, i.e. the preprocessing should be robust. Diagnostic methods, like 
jackknife after bootstrap P] pp. 113-123], can be employed to evaluate the boot- 
strap estimate, however, due to the masking effects these techniques have only 
limited sensitivity in the presence of significant contamination. 

Most of the theoretical results which validate the bootstrap paradigm are ob- 
tained by Edgeworth expansions assuming a smooth estimator. For non-smooth 
estimators there are methods which can be used instead, but they do not enjoy 
the same accuracy as the bootstrap itself ^ pp. 41-44]. The classical exam- 
ple where the bootstrap completely fails is the sample median. One should be 
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therefore extremely circumspect in using the techniques described above for es- 
timators based on the median or other nonlinearities. The bootstrap procedure 
should be only applied to the data obtained at the output of high breakdown 
point robust estimators like the least median of squares. 

The traditional approach based on error propagation approximates cov(0) 



Oi=Y: 



\dg] 




\dg] 


dzi 


^ Zi 


dzi 



Zi 



— Zi 



( 6 ) 



where Cz^ is the covariance of the data Zi and 




is the Jacobian of the trans- 



formation from Zi to 9 = g{z\, Z 2 , - ■ ■ , z^). Approximating cov(0) by bootstrap 
eliminates the need for analytical derivations in the Jacobian calculation, at 
the expense of an increased amount of computer simulations. 

The bootstrap can be also used in constructing elliptical confidence regions 
for 9. They have better coverage compared with the rectangular ones since they 
exploit the existing correlation between the components of 9 and can be obtained 
even when the normality assumption fails HE|. 

The exact number of bootstrap samples B required to compute the covariance 
matrix C g or the confidence regions is difficult to prescribe. In practice, especially 
when the computation of 9 is time consuming, the trade-off is between the 
accuracy of the bootstrap solution and the time spent on simulations. 

We have found that B = 200 usually suffices for a good covariance estimation. 



2.2 Bootstrap for HEIV Regression 

The bootstrap method introduced in Section E~T1 cannot be applied directly to 
HEIV regression. Assume that the true, unknown measurements Zio, i = 1, • • • , n 
are additively corrupted with heteroscedastic noise 5zi having zero mean and 
point dependent covariance Czi- The covariances are known up to a common 
multiplicative factor, the noise variance cr^. The true values Zio obey the linear 
model a -|- zj^9 = 0. 

The covariance, bias and confidence regions of a given estimator {9, a} are 
obtained by bootstrap using the procedure sketched in Figure ^ An estimator 
which takes into account the heteroscedasticity of the data (HEIV block) is em- 
ployed to obtain the corrected data Zi by projecting onto hyperplane of the 
solution {9, a}. In order to satisfy the i.i.d. condition the residuals 6zi = Zi~ Zi 
are whitened using the corresponding covariances Cszi- See (ICTI) for their ex- 
pression. The whitened residuals are sampled with replacement and colored with 
the corresponding covariances . The bootstrapped data is finally obtained by 
adding the resampled noise to the corrected data Zi. Any estimator can now be 
evaluated through bootstrap using (0) and (0). 
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(b) 



Fig. 1. Bootstrapping in a heteroscedastic environment, (a) Recovery of an i.i.d. noise 
process using the HEIV estimator, (b) Generation of bootstrap samples by coloring 
the i.i.d. residuals. Any estimator can now be evaluated. 



3 Rigid Motion Evaluation 



Rigid motion estimation is a basic problem in middle-level computer vision, 
thoroughly analyzed in numerous papers. Two of the most popular rigid motion 
estimators are based on the SVD decomposition [lElj and on the quaternion 
representation [ni. Both assume i.i.d. data and give exactly the same results 0. 
The 3D rigid motion estimation under heteroscedastic noise was investigated by 
Pennec and Thirion m- They used an extended Kalman filter for finding the 
motion parameters and obtained closed-form expressions for the covariance of 
the rotation and translation based on error propagation. However, the data was 
processed sequentially, and thus not all the available information being taken 
into account at each processing stage. The computation of the Jacobians enter- 
ing the expressions of the covariance matrices was quite complex. Recently, Ohta 
and Kanatani nq introduced an algorithm for rotation estimation under het- 
eroscedastic noise based on an optimization technique called renormalization and 
provided a lower bound on the covariance of the rotation. However, the analysis 
was restricted to pure rotation and in the presence of a translation component 
the estimator was no longer optimal. 

The bootstrap methodology introduced in Section 0 is utilized in the sequel 
in the analysis of 3D rigid motion. Reliable assessment of the accuracy of the 
solution yielded by an arbitrary rigid motion algorithm under heteroscedastic 
noise is provided using only the available data. 

Let the two sets of matched 3D measurements be U = {mi, it 2 , • • • , it„} and 
V = {vi, V 2 , • • • , Vn}- Each measurement is a corrupted version of the true value. 



Bootstrapping Errors-in- Variables Models 241 



distinguished by the subscript ‘o’, 

Ui = U,o + SUi Vi = Vio + 6Vi . 

The heteroscedastic noise has zero-mean and data dependent covariance 
E[6ui6uJ] = Cui, and E[SviSvJ] = Cy. respectively. Rigid motion estimation 
is a multivariate EIV regression problem, since the true values satisfy the 3D 
rigid motion constraint 

Vio — t , (7) 

where R is the 3x3 rotation matrix and t is the translation vector. 

The quaternion representation of the rotation matrix transforms (0 into an 
equivalent multivariate linear regression m 

a + MoQ = 0 , (8) 

where Mq is a 3 x 4 matrix obtained from the true coordinates, q is the quater- 
nion of the rotation and a is the intercept depending on q and t. A solution 
of (0 taking into account the heteroscedasticity was presented in m and is 
summarized in Appendix 0 

Let R and t be the rotation and translation given by a consistent rigid motion 
estimator, i.e. which reaches the true solution asymptotically. We have used the 
HEIV algorithm which satisfies this consistency requirement. 

The residuals Svi = Vi — Vi and Sui (ED are obtained by orthogonal projec- 
tion of the measurements Vi and Ui onto the three dimensional manifold defined 
by R and t in 77.®. as shown in Appendix [Bit he corrected points Vi and Ui can 
be considered unbiased estimators of the true, unknown measurements Vio and 
Uio when the estimates R and t are close to the true R and t. The residuals are 
not i.i.d. having their covariance dependent on the measurement index i. There- 
fore, to correctly apply the bootstrap procedure a whiten-color cycle is therefore 
necessary (see Section E3). 

The bootstrap of residuals method required a consistent estimator. Once the 
correct noise process is recovered, however, any rigid motion estimator (consis- 
tent or not) can be analyzed in the same automatic manner. 

*f) ■^b 

Let R and t be the rotation and translation yielded by an arbitrary 
estimator using the bootstrap sample b. The covariance and bias for translation 
can be estimated using o and (0), but for the covariance of rotation the fact that 
the rotations form a multiplicative group must be taken into account j‘21)] . Let 
the three-dimensional vector r be the angle-axis representation of the rotation 
matrix R, defined as r = f{R). The rotation error between the estimate R and 
R is then Sf — f(RR^), and the bootstrap for the covariance of the rotation 

Cf uses in (0 the Sr*^ = f{R R ). 

Though confidence regions can be computed separately for R and t, a better 
joint coverage is obtained by exploiting the existing correlation between the 
rotation and translation estimates. Define the motion estimation error e as the 
six-dimensional vector 

e = [{i — t)^ i5r^]^ . 



(9) 
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The confidence region in 7?.® for the rigid motion parameters is constructed using 
the error terms e*^ = see Sectional 

3.1 Experiments with Synthetic Data 

The simulated setting used in our experiments consists of a stereo head moved 
around a fixed scene. The cameras had zero vergence and focal distance / = 536 
yielding a field of view of 50° on both x and y axes. The baseline of the stereo 
head was 100 and the image planes were 500 x 500, all values being in pixel 
units. The n = 50 three-dimensional points were uniformly generated inside a 
cube with the side length 800 placed at 1300 in front of the cameras. The 3D 
points are projected onto the image planes, corrupted by adding normal noise 
with cr = 1 and then allocated to the nearest lattice site. The 3D information 
is recovered using Kanatani’s triangulation method m pp. 171-186]. For this 
type of triangulation there are close form expressions for and Cui , however 
we preferred the bootstrap computation described in m since is more general, 
being also applicable to any other triangulation method like those presented in 

HIES. 

The bootstrapped covariance matrices Cui inflated to assure an individual 
coverage of 0.95 are represented in Figure |3 for ten data points. The equivalent 
error is much higher on the depth and depends on the position of the 3D point 
in space, thus confirming the theoretical results EED. 




Fig. 2. The covariance matrices Cu^ extracted by bootstrap for the 3D data points. 
Note the heteroscedasticity. 

The analysis of the rigid motion employed B = 200 bootstrap samples gener- 
ated as discussed in Sectional The validity of the bootstrap was verified using a 
Monte Carlo analysis based on the true values and the continuous noise process. 
Recall that the bootstrap uses only the available quantized data. 

For both methods the covariance matrices were estimated using B = 200 
samples. Fifty trials, each having the motion parameters randomly generated 
(chosen such that the scene remains in the field of view of the cameras) and 
different 3D point configurations were performed. 

In Figure 0 the translation and rotation error \\St\\ = [trace(i7()j = , ||Jr|| = 
[trace(/7f)j ^ are plotted for the HEIV and quaternion algorithms using the boot- 
strap and the Monte Carlo estimates for fif., fif m- 

Note the very good agreement between the bootstrap and Monte Carlo error 
estimates and the larger translation and rotation error yielded by the quaternion 



Bootstrapping Errors-in- Variables Models 243 







, > , 






- * S 


‘ ■' ■ < '• '* 


.. 








t 









Index of trials Index of trials 

(a) (b) 



Fig. 3. Comparison between the bootstrap (BT) and Monte Carlo (MC) error estimates 
for HEIV and the quaternion method, (a) Translation, (b) Rotation; ‘o’ MC estimate 
for HEIV, ‘o’ BT estimate for HEIV, ‘ + ’ MC estimate for quaternion, ‘x’ BT estimate 
for quaternion. 

method compared with the HEIV algorithm. The same approach can be used 
to compare the performance of other motion estimators, however, the bootstrap 
must use the residuals generated by a consistent estimator like HEIV. 



3.2 Experiments with Real Data 

The CIL-CMU database consists of four data sets: castle, planar texture, wall 
& tower and copper tea kettle. Each set consists of 11 frames for which precise 
ground truth information is available. Lack of space prohibits us to present all 
the results and we describe only the experiments using the frames 1, 4, 5 and 9 
from the planar texture set. 




Fig. 4. Frames one and four from the Planar Textures images from CIL-CMU database. 





Fig. 5. Uncertainty of the translation estimation for the quaternion (left) and the 
HEIV (right), evaluated by bootstrap. The ellipsoids assure a 0.95 coverage for the 
true translation. Both plots are at the same scale. 
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Fig. 6. Uncertainty of the rotation estimation for the quaternion (left) and the HEIV 
(right) evaluated by bootstrap. The ellipsoids assure a 0.95 coverage for the true rota- 
tion. Both plots are at the same scale. 

The 3D information was extracted by triangulation from image points ob- 
tained by the matching program of Z. Zhang ES!- The estimation used 46 mea- 
surements. The true translation was t = [79.957 —0.503 —1.220]^ and the ro- 
tation in angle-axis representation r = 0.001 • [0.619 0.193 0.013]. The quater- 
nion algorithm estimates erroneously the translation and rotation yielding t — 



[-159.001 69.867 13.050]^, respectively r = [0.035 0.118 -0.013]^. On the 
other hand the HEIV solution is much closer to the true value, being 
i = [56.094 6.492 -1.392]^ and r = [0.004 0.012 -0.001]^. 



nion and HEIV methods are plotted in Figures 0andQ As expected the quater- 
nion based algorithm has a much larger variability than the HEIV, since it 
assumes i.i.d. data. 

4 Bootstrapping the Fundamental Matrix 

The geometry of a stereo head is captured by the epipolar geometry. In the case 
of uncalibrated cameras the relationship between the two sets of matched points 
{xuo}^ {xrio} in the right, respectively the left image can be expressed using 
the fundamental matrix F as 



Q, = p(33) gT ^ [^(31) p{32) p{13) p(23) p(ll) p(21) p(12) _p(22)j ^ 



The bootstrapped covariance matrices of the estimates R, t for the quater- 




The eight-point algorithm introduced by Longuet-Higgins solves the lin- 
earized epipolar constraint 




( 11 ) 

( 12 ) 



where vec(A) denotes the vectorization of the matrix A. The procedure is equiv- 
alent to a Total Least Squares (TLS) solution of a regression model. The sensi- 
tivity of the eight-point algorithm to noise affecting the image points is partially 
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remedied by normalizing the image points as shown by Hartley m- The normal- 
ized eight-point algorithm can be subsequently refined using nonlinear criteria 
(distance from the noisy image points to the epipolar lines, the Gold Standard, 
etc.) and the Levenberg-Marquardt optimization technique. A comprehensive 
review of the nonlinear criteria is m- 

The uncertainty of the fundamental matrix computation can be characterized 
by the covariance matrix of F and the confidence bands for the epipolar lines . 
In PI the covariance of F was computed in two ways: using error propagation 
and performing Monte Carlo simulations. The former method had the advan- 
tage of yielding closed-form covariance estimates, but the computations involved 
were quite involved. In the latter technique, B realizations of the fundamental 
matrix were computed by adding normal noise to the noisy image points. The 
sample covariance matrix was finally determined from these fundamental matrix 
realizations. The ad-hoc choice for the noise standard deviation and the addi- 
tion of noise on the image points (already noisy) are the major shortcomings of 
this approach. These deficiencies can be solved by the bootstrap methodology 
presented below. 

The HEIV algorithm can be used to solve (II I II and to extract the statistical 
information required by the bootstrap 1 1 bj . Assuming for simplicity that the 
image points Xio are corrupted by zero-mean, i.i.d. noise Sxi ^ G/(0, < 7 ^ 74 ), 
the carriers Zio are corrupted by heteroscedastic noise 

5Z,^GI{0, a^Cz,), Cz, 

with GI{fi,C) standing for general and independent distribution with mean fj, 
and covariance C. The HEIV estimates the noise variance in ()A.t)|l and the 
fundamental matrix F from 9 and a. It can be shown (the proof is beyond the 
scope of the paper) that the HEIV corrected image points Xi obeying <f{xi, F) = 
0 are obtained by iteratively solving the equation 

T 

&S~^{x,,F) (14) 



Xi = X,, - 



dZ, 



dx. 



dZ, 




dZ, 


1 




dxi 



where 0 and Si are defined in ipniii . The rank-one covariance of the residuals 
5xi = Xi — Xi is 



^ Sxi — 



dZ, 

dxi 



1 T 



&S, 0^ 



dZ, 

dxi 



(15) 



With the residuals Sxi and their covariance matrices m available, the bootstrap 
can be applied as shown in Section 



4.1 Experiments with Synthetic Data 

A synthetic camera with the geometry depicted in FigureQis used in the follow- 
ing simulations. The image plane is 500 x 500 and the focal distance is / = 117, 
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corresponding to a field of view of 130°, all units being in pixels. The 3D points 
are generated in a cube with side equal to 2000 and with the center placed in 
front of the cameras at Z = 1500. Normal noise with zero-mean and tr = 3 
was added to n = 50 matched image points. To illustrate the potential of the 
proposed technique we applied it to a well known method for recovering the 
fundamental matrix, the normalized eight-point algorithm. 

Statistical measures for the normalized eight-point estimates are extracted 
with bootstrap and Monte Carlo simulations for comparison. The HEIV algo- 
rithm is used to recover the residuals and their covariance matrices, as described 




Fig. 7. The stereo head used in the simulations and the points in the left image. 




Fig. 8. Histogram of the x (left) and y (right) coordinates of the left epipole estimated 
with the normalized eight-point algorithm from data obtained through bootstrap and 
Monte Carlo. 



Bootstrap Confidence Band 




(a) 



Monte Carlo Confidence Band 




(b) 



Fig. 9. Confidence bands for the epipolar lines in the right image created with boot- 
strap (a) and Monte Carlo simulations (b). 
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above. In Figure 0 the histograms of B = 400 bootstrap and Monte Carlo sam- 
ples of the left epipole estimates are shown. Note the good approximation of the 
true distribution (Monte Carlo) obtained through bootstrap, which used only 
one set of data points. 

Most often the uncertainty in the fundamental matrix estimation is repre- 
sented through the confidence bands of the epipolar lines 0. In Figure 0 such 
confidence bands for four epipolar lines in the right image are plotted using the 
bootstrap and Monte Carlo estimates for the covariance of F. Note the slight dif- 
ference between the two plots. The bootstrap samples are distributed around F, 
the only information available from the image pair. The Monte Carlo simulation, 
on the other hand, is distributed around the true F. 

5 Conclusion 

We have exploited the bootstrap paradigm for the analysis of heteroscedastic 
errors-in- variables models. We have shown that through this method valuable 
information can be recovered solely from the data, which can be integrated into 
the subsequent processing modules. 
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A HEIV Regression 



A short presentation of the HEIV algorithm applied to multivariate regression 

(k) 

is given next. Assume that the true values € 7^^, i = 1, • • • , n satisfy the 
multivariate linear model 

a. + Zioe = 0, OgTZP, Z,o= ~ 



^( 1 ) Arn) 

^io ^io 



The true values are corrupted by heteroscedastic noise Sz[^\ 
= z^^J + 5z^^^ with 



E 



5z 



(k) 



= 0, cov 



5z^!^\5z^!'> 



Define 



= cf\ = 

^{kl) 



Wio = vec(Z^), Swi = vec{6Zj), Ci = 

The estimates 0, & are determined minimizing 

1 " 

^ = X ^ (lO* - Wio)^C~ {Wi - Wio) 



(A.l) 






subject to the constraint a + ZioO — 0. 
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In practice are unavailable. Let be the corrected measurement cor- 
responding to z . Using the Lagrange multipliers r]i S D becomes 

^ n n 

J = - {Wi -Wi) + '^ riJ {Z,e + a) . (A. 2) 

i=l i=l 

Define the mp x m matrix 0 and the m x m matrix Si 

0 = 1^® 9, A = 0^C,0, (A.3) 

where 0 is the Kroneker product between two matrices. The estimates 9, a and 
z^^^ are found by solving the system of equations 

Vg J = 0, VqJ = 0, V.(fc) = 0 . 



After some algebra the estimates 9 and & are obtained by iteratively solving a 
generalized eigenvalue problem. 

1. Compute an initial solution 9^ \ for example the Total Least Squares (TLS) 
estimate obtained assuming i.i.d. noise. A random initial value, however, 
sufficed to achieve satisfactory convergence in a large range of problems. 

2. Compute the matrices Si, i = 1, . . . ,n using D>. 

3. Compute the weighted “centroid” matrix Z 



Z = 




E 



s, 



4. Compute the scatter S 




relative to Z 



(A.4) 






n 






^-1 






(A.5) 



and the weighted covariance matrix 



^o-)i 






(A.6) 



Z=1 



, 



where the Lagrange multipliers are {Zi — Z)6. 



From (E31), (E3 s 



0 )' 



and C 



U) 



are positive semi-definite. 



5. The estimate 9^^~^ ^ is the eigenvector corresponding to the smallest eigen- 
value in the generalized eigenproblem 
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6. Iterate through Steps Utoi until A becomes one (up to a tolerance). Con- 
vergence is achieved after three, four iterations. Let 6 be the final estimate. 

7. Compute the intercept & = —ZO. 

8. Compute the corrected measurements 



i i 






(kl) 






(A.8) 



The matrices Ct need to be known only up to a positive multiplicative constant, 
the equivalent noise variance which can now be estimated as 



f s\e]e 

mn — p+ 1 



(A.9) 



The a posteriori covariance of Wi = vec{Zi) is 



= C, - = C,- ^ 0^ C, . 



(A.IO) 



Note that rank(Cu,J = rank(Ci) — m. 

'' (k^ 

The estimates 9, a, z\ ’ are consistent, i.e. they converge to the true values as 
the number of measurements increases. Using Taylor expansions of eigenvectors 
a first order approximation of the covariance of the estimates 0, a is 






S{9) - C{9) 



C& = ZCgZ 



(A.ll) 



B Analysis of the Data Correction for 3D Rigid Motion 



Given the estimated motion parameters R and t we are interested in finding 
the projections Di, Ui of Vi and Ui onto the three dimensional manifold of the 
solution in 72.® defined by Vi = Rui + 1. To simplify the notations in the sequel 
the measurement index i is dropped. Let 



A= [h -R] 



V 




V — i 




w = 




u 




u 



Thus the solution w should satisfy Aw = 0, i.e. must be in the three-dimensional 
null space of A. In other words, w must be in the range space of a rank 3 matrix 
B, chosen such that AB = 0. A natural choice is B — Jg]^. Then the 

projection of w onto the space spanned by the columns of B under the metric 
given by the covariance matrix of w, Cw is P21 p.386] 



w = B{B^C-^B)-^B^C~^w = Pw . (B.l) 



The measurements in the two sets of 3D points are uncorrelated, therefore 



C, = C 



w 



c„ 0 

0 Cu 
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The projection matrix P can be expressed after some algebra as P = [Pij], 



Pii = 



1 3 + 

Pl 2 = RP22 



P22 = 



I3 + CuR^C-^R 



P21 = R Pii 



(B.2) 

(B.3) 



When the estimates R and t are close to the true values R and t the weighted 
least squares projection a implies that the estimator z of jZq can be consid- 
ered unbiased, and the residuals 6z = z — z to have zero mean. 

E[z] = E[5z\ = E[z\ - E[z\ = 0 . (B.4) 

Using dB.II the covariance matrices of z and the residuals 5z are 

(B.5) 



Ci = PC,P^ CsB = [le - P] C, [le - P]^ 



The covariance of the residuals depends only on the rotation and has rank three. 
From (E21-Enj and (ESI) the covariances for Sv and 6u are 



Csi, = [I3 - Pii] 
Csu = [I3 - P22] 



RCuR^ + Cl 

R CyR C I 



\l3 - Pi 



\l3 - P 2 



(B.6) 

(B.7) 
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Discussion 

Bill Triggs: Two things. One is just a minor grumble. You say in the earlier 
part of your talk that optimization, Levenberg-Marquardt, is slow, so you give 
another method. But your method does a very similar update calculation to 
Levenberg-Marquardt using a slower algorithm — eigendecomposition rather 
than linear solution. 

Bogdan Matei: The updates are fairly similar. However, the algorithm is de- 
signed for errors-in- variables models which are not handled very well by optimiza- 
tion techniques like Levenberg-Marquardt. That’s why Levenberg-Marquardt 
usually needs more iterations to converge. 

Bill Triggs: Secondly, with these bootstrap-type methods you’re very much 
a prisoner of the number of samples you have. In statistics, if you try to fit 
covariances to a high-dimensional model, they’re often unstable because you 
simply don’t have enough data to estimate 0{n?) covariance parameters. So 
despite its limitations, an analytical approach is in some sense more informative. 
Bogdan Matei: Using analytical covariances in performance evaluation would 
make us the prisoners of elliptical confidence regions and local approximations. 
With sparse data in a high dimensional space, even a theoretical approach to 
computing the covariance would experience the same problems in the absence of 
ground truth information. 
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Abstract. Video as an entertainment or information source in consumer, military, 
and broadcast television applications is widespread. Typically however, the video 
is simply presented to the viewer, with only minimal manipulation. Examples 
include chroma-keying (often used in news and weather broadcasts) where specific 
color components are detected and used to control the video source. In the past 
few years, the advent of digital video and increases in computational power has 
meant that more complex manipulation can be performed. In this paper we present 
some highlights of our work in annotating video by aligning features extracted 
from the video to a reference set of features. 

Video insertion and annotation require manipulation of the video stream to com- 
posite synthetic imagery and information with real video imagery. The manip- 
ulation may involve only the 2D image space or the 3D scene space. The key 
problems to be solved are : (i) indexing and matching to determine the location 
of insertion, (ii) stable and jitter-free tracking to compute the time variation of 
the camera, and (iii) seamlessly blended insertion for an authentic viewing expe- 
rience. We highlight our approach to these problems by showing three example 
scenarios: (i) 2D synthetic pattern insertion in live video, (ii) annotation of aerial 
imagery through geo-registration with stored reference imagery and annotations, 
and (iii) 3D object insertion in a video for a 3D scene. 



1 Introduction 

The ability to manipulate video in digital form has opened the potential for numerous 
applications that may have been difficult to implement with a purely analog representa- 
tion of video. With the representation of video in terms of its underlying fundamental 
components of geometry, temporal transformation and appearance of patterns, any or all 
of these can be manipulated to modify the video stream. Seamless insertion of 2D and 
3D objects, and textual and graphical annotations are forms of video manipulation with 
wide-ranging applications in the commercial, consumer and government worlds. In this 
paper we present some highlights of our work on annotation and insertion of synthetic 
objects and information into video and digital imagery. 

Video insertion and annotation require manipulation of the video stream to composite 
synthetic imagery and information with real video imagery. There are a number of 
dimensions of this problem. First, the insertion may either demand 2D representation 
of the video and the camera motion, or 3D objects may need to be composited in 3D 
scenes with arbitrary camera motion. Second, the insertion and manipulation may rely 
on a small and fixed collection of landmarks in the scene or there may be a large database 
of stored reference imagery or nothing may be known a priori about the scene. 
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In the first scenario, we address the problem of inserting 2D patterns into broadcast 
video with essentially a hxed but pan/tilt/zoom camera. The technology has matured 
into products that are currently being used on a regular basis for insertion of virtual 
billboards and game-related synthetic annotations in broadcast sports videos [Bll, im. 
The second scenario is that of accurately locating current videos of a locale into a stored 
reference image database and subsequently visualizing the current video stream and the 
annotations in the database as footprints registered with the database imagery. We are 
currently building a real-time system for this capability for geo-registration of aerial 
videos to a reference database 0. In this scenario, the transformations that relate the 
video to the database may be 2D or 3D. The hnal scenario is of a 3D scene imaged with 
an arbitrarily moving camera and the goal is to be able to insert synthetic 3D objects 
into the real imagery. No a priori knowledge of the 3D scene is assumed. 

The underyling technical problems that need to be solved for the above scenarios are: 
(i) indexing and matching the video to precisely align locate the video frame, (ii) stable 
and jitter- free camera pose estimation in 2D and/or 3D, and (iii) seamless insertion of the 
synthetic pattern and objects for a visually pleasing experience. In this paper we present 
highlights of our solutions to the technical problems and demonstrate the validity of our 
approach through visual results. 



2 2D Video Insertion 



The basic approach for 2D video insertion is in three steps: training, coarse indexing, 
and hne alignment Q. The training steps are performed in non real-time in a set-up 
phase, and the coarse indexing and hne alignment steps are performed in real-time using 
a hardware system. 

In the training step, imagery that will be used as reference for the coarse indexing 
and hne alignment steps is captured. The target region can be occluded or can enter or 
leave the held of view when insertion is being performed, and therefore the approach 
is to record imagery not just of the target region itself, but also of surrounding regions. 
The images are then aligned to each other so that the location of the target region can be 
inferred from the recovered locations of surrounding regions. 

In the coarse indexing step, regions that have been identihed in the reference imagery 
are located in the current imagery using a hierarchical pattern tree search0. The pattern 
tree comprises a set of templates from the reference imagery and their relative spatial 
transformation in the coordinate system of the reference imagery. Coarse search begins 
by correlating the coarsest resolution templates across the current video image. If a 
potential match is found, then the next template in the pattern tree is correlated with the 
image using the relative spatial transformation in the pattern tree to compute the location 
around which correlation should occur. This process is repeated for more templates in 
the pattern tree, and the model relating the current image to the pattern tree is rehned 
as successive potential matches are found. A successful search is declared if a sufficient 
number of templates are matched in the pattern tree. 

The hne alignment step recovers the precise transformation between the current 
image and the reference imagery. The current image is hrst shifted or warped to the 
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Fig. 1. Left: Original image from a sequence. Right: The image after manipulation. 



reference imagery using the model provided by the coarse search result. The model 
parameters are then refined using an alignment method described in lU. 

Once the precise alignment between the reference and current imagery has been 
determined, graphics in the coordinate system of the reference imagery can be warped 
to the coordinates system of the current imagery and inserted. 

FigureQ]shows a simple insertion example. The image on the left shows an original 
frame from a sequence, and the image on the right shows a manipulated frame with a 
new logo superimposed on top of the left hand box. 

3 Geo-registration 

Aerial video is rapidly emerging as a low cost, widely used source of imagery for 
mapping, surveillance and monitoring applications. The mapping between camera co- 
ordinates in the air and the ground coordinates, called geospatial registration, depends 
both on the location and orientation of the camera and on the distance and topology of 
the ground. Rough geospatial registration can be derived from the BSD (Engineering 
Support Data: obtained from the GPS and inertial navigation units etc.) stream pro- 
vided by the airborne camera telemetry system and digital terrain map data (from a 
database). This form of registration is the best available in fielded systems today but 
does not provide the precision needed for many tasks. Higher precision will be achieved 
by correlating (and registering) observed video frames to stored references imagery. 
Application of precise geospatial registration include the overlay of maps, boundaries 
and other graphical features and annotations onto the video imagery. 

We present the details and results of some of the key algorithms we have developed 
in our laboratory towards implementing the overall system for geo-spatial registration. 
This work extends the previous body of work based on still imagery exploitation using 
site models |2^]. We include more recent work on developing a real-time georegistration 
system. 

3.1 Our Approach 

A frame to frame alignment module first computes the spatial transformation between 
successive images in the video stream. These results are used in both the coarse indexing 
and fine geo-registration steps as decribed in the following sections. 




256 



KJ. Hanna et al. 



The engineering support data (BSD: GPS, camera look angle etc.) supplied with 
the video is decoded to define the initial estimate of the camera model (position and 
attitude) with respect to the reference database. The camera model is used to apply an 
image perspective transformation to reference imagery obtained from the database to 
create a set of synthetic reference images from the perspective of the sensor which are 
used for coarse search and fine geo-registration. 

A coarse indexing module then locates the video imagery more precisely in the 
reference image. An individual video frame may not contain sufficient information to 
perform robust matching and therefore results are combined across multiple frames using 
the results of frame to frame alignment. 

A fine geo-registration module then refines this estimate further using the relative 
information between frames to constrain the solution. 



3.2 Frame to Frame Alignment 

Video frames are typically acquired at 30 frames a second and contain a lot of frame- 
to-frame overlap. For typical altitudes and speeds of airborne platforms, the overlaps 
may range from 4/5 to 49/50th of a single frame. We exploit this overlap by converting 
a redundant video stream into a compact image stream comprising key frames and 
parametric models that relate the key frames. For instance, typically 30 frames in a 
second of standard NTSC resolution (720x480) video containing about lOM pixels 
may be reduced to a single mosaic image containing only about 200K to 2M pixels 
depending on the overlap between successive frames. The successive video frames are 
aligned with low order parametric transformations like translation, affine and projective 
transformations P6(|. 

3.3 Coarse Indexing/Matching 

We present a solution to the coarse matching problem where geometric changes and 
poor matching of features are handled by combining local appearance matching with 
global consistency. 

In our current real-time implementation, local appearance matching is performed 
using normalized correlation of multiple image patches in the image. These individual 
correlation surfaces often have multiple peaks. Disambiguation is obtained by impos- 
ing global consistency by combining the frame-to-frame motion information with the 
correlation surfaces. Specifically, we are looking for a number of potentially poor local 
matches that exhibit global consistency as the UAV flies along. This is currently imple- 
mented by multiplying the correlation surfaces after they have been warped or shifted 
by the frame-to-frame motion parameters. Figure Qshows an example of this process. 
The top row shows three images from UAV video. The middle row shows the results of 
correlating regions from the reference imagery to the current imagery. Note that there 
is no single clear peak in the correlation surfaces. The bottom image shows the results 
of multiplying the correlation surfaces together after warping by a transform computed 
from the frame to frame parameters. Note that there is now a single peak in the correlation 
surface. 
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Fig. 2. Top Row: Original key frames. Middle Row: Results of correlating an image portion with 
the reference imagery. Bottom: Results of multiplying the correlation surfaces after warping using 
the frame to frame results. 



This scheme does not specifically address the occlusion or drastic change of a region, 
hut this is somewhat mitigated hy repeating the process on different patches of the image 
and by selecting the best result as the candidate match. 

3.4 Fine Geo-registration 

The coarse localization is used to initialize the process of fine alignment. We now present 
the equations used for fine alignment of video imagery to a co-registered reference 
mage and depth image. The formulation used is the plane+parallax model developed 
by 1711 1 11 .31 . The coordinates of a point in a video image are denoted by (x, y). The 
coordinates of the corresponding point in the reference image are given by (ATr,l^). 
Each point is the reference image has a parallax value k. The parallax value is computed 
from a digital elevation map (DEM) which is co-registered with the reference image. 
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Twelve parameters ai...ai 2 are used to specify the alignment. The reference image 
coordinates (X^, Y^) are mapped to the ideal video coordinates (X/, Y/) by the following 
equations: 



ai * Xr + a2 * Yr + as * k{Xr, Yr) + aio 

07 * Xj. + ag * Y- + flg * k{Xr,Yr) + Ol2 

04 * Xr + 05 * Y- + Ct6 * k{Xr, Yr) + On 

07 * Xi- + Og * Yr + 09 * k{Xr, Yr) + O12 



( 1 ) 

( 2 ) 



Note, since, the right hand side in the above two equations is a ratio of two expres- 
sions, the parameters ai..ai 2 can only be determined up to a scale factor. We typically 
make parameter 012 = 1 and solve for the remaining 1 1 parameters. In the case of 
the reference image being an orthophoto with a corresponding DEM (digital elevation 
map), the parallax value fc at a location is equal to the DEM value at that location. When 
the reference image is a real image taken from a frame camera (where the imaging is 
modeled with perspective projection) the parallax value k at any reference location is 
calculated from the depth z at that location using the following equation Ennni: 

( z — z) ^ z 

k = 22 — 2^ (3) 

z * az 

where z and cr^ are the average and standard deviation of the depth image values. 



Parameter Estimation The reference imagery and the current imagery may be signif- 
icantly different, and also each individual current image may not contain a significant 
number of image features. Therefore it is not particularly robust to solve for the camera 
parameters that map the current image to the reference from a single frame. Instead the 
approach is to use the results from the frame-to-frame processing to constrain the simul- 
taneous solving of several sets of frame to reference parameters. Once a refined set of 
parameters have been recovered, local image matches between the current and reference 
imagery are recovered. These matches are then used to refine the sets of parameters as 
described below. 

The output of the coarse search step is an approximate transform between a current 
image and the reference imagery. The output of the frame-to-frame processing is the 
transform between successive current images. These are fed into a global minimization 
algorithm |1 2| | that minimizes the error between virtual points that are created using the 
frame-to-frame parameters and the coarse search parameters to recover an estimate of 
the transform between each of the current images and the reference imagery. 

The next step is to use image matches between the current and reference imagery at 
each frame to improve accuracy. This is performed by computing point correspondences 
using a hierarchical flow algorithm PJ. This algorithm assumes brightness constancy, 
but the images are pre-hltered with Laplacian filters in order to reduce the impact of 
illumination changes. The algorithm computes local matches at every image pixel first 
at a coarse resolution, then refines them at hner resolutions. 

These point matches are then sampled and then together with the frame-to-frame 
parameters are fed into the global minimization algorithm once more iHzl- The result is 
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a set of refined parameters for each frame. These parameters are then used to warp or 
shift the reference imagery to each current image. Point matches between the warped 
reference imagery and the current imagery can be computed again and used to refine the 
model parameters even further. 

Figure0shows a geo-mosaic consisting of current imagery that has been warped and 
overlaid on top of the reference imagery. Note that the features in the overlaid mosaic 
line up with features in the reference imagery. 




Fig. 3. A geo-mosaic warped and overlaid on the refence imagery. 



FigureElshows a frame from the output of a real-time geo-registration system that is 
being built at Sarnoff that uses a VFE-200 processor for frame to frame alignment and 
for coarse search, and an SGI computer for fine alignmenf. The currenf imagery is shown 
in the center of the figure, and annotations and the reference imagery in the background 
have been aligned to it. Remaining misalignments are due to the use of a distortion-free 
camera model in the current development phase of the real-time system. 

4 3D Match Move 

In geo-registration we assumed that reference models and reference imagery existed. 
Annotating the video in that case, involved solving for the pose of the video frames with 
respect to the reference image. In many applications such as in film production, reference 
imagery and models are offen not available. The problem here therefore is that given a 
video sequence of N frames, we wish to compute the camera poses (the rotations and 
translations) without the knowledge of the 3D model and with some rough knowledge 
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Fig. 4. A frame from UAV video overlayed on the warped reference imagery with annotations 
being displayed 



of the internal parameters of the camera. The approach should work with a variety of 
different 3D camera motions, especially those in which novel parts of the scene appear 
and disappear relatively rapidly. Of course, imaging scenarios in which features remain 
fairly persistent, as in fixated motions, are also naturally handled. A sparse collection 
of 3D features are also computed in the process of pose computation. In general, the 
problems of correspondence over the sequence, camera pose and 3D structure estimation, 
and camera calibration estimation are tied together. Solving all the problems in a single 
optimization problem is complex and will in general not lead to stable and correct 
solutions. Therefore, we adopt a strategy of progressive complexity with feedback in 
which earlier stages work on smaller subsequences of the data and generate inputs for the 
latter stages which estimate consistent poses over the entire sequence. The advantage of 
dividing the problem into subsequence estimates and then combining these is threefold: 
(i) better computational efficiency since global bundle block adjustment for all frames is 
expensive, (ii) progressively better generation of estimates leading to a global maximum 
likelihood estimation, and (iii) limited build up of error due to small scale concatenation 
of local pose estimates (akin to 1 1 211 41 1. 

4.1 Pose Estimation for Unmodeled Scenes 

We divide the pose estimation problem for unmodeled scenes into the following different 
steps: 
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1. Feature Tracking: Our method allows for frame-to-frame patch tracking while al- 
lowing for new features to emerge and older features to disappear. The first step is 
to choose new features in every new frame. Features are chosen on the basis of their 
contrast strength and distinctiveness with respect to their local neighborhoods. In 
any given frame both new points and points projected from a previous frame are 
checked for flow consistency at their locations. Points that are flow consistent are 
kept for further tracking. The process of instantiation of new points, projection of 
previous points into a new frame and flow consistency checks is repeated over the 
whole sequence to obtain multi-frame point tracks. 

2. Pairwise estimation of camera poses: Initial estimates for the camera poses are 
computed using the fundamental matrix constraint. The fundamental matrix can be 
computed by a number of known techniques. We employ Zhang’s m algorithm 
that combines a linear method for initialization, and then rehnes it with a method that 
employs image based error minimization. Furthermore, outliers are rejected using 
a least median squares minimization. The hnal fundamental matrix is computed 
using the image based error measure after outlier rejection. With the knowledge of 
the approximately known calibration, the fundamental matrix can be decomposed 
into the camera pose matrices using the technique of Hartley |5j- 

3. Computation of camera poses for sub-sequences: In order to exploit the static rigid 
scene constraint, the pairwise camera estimates are used to create consistent camera 
pose estimates and the corresponding 3D point locations over short subsequences. 
Point tracks that persist for the time period of each subsequence are used to create 
the consistent estimates. In order to compute the maximum likelihood estimates for 
the camera poses and 3D points for the subsequence, a bundle block adjustment is 
applied. 

4. Aligning sub-sequences: Subsequence computation is performed with a few frames 
overlap between consecutive sub-sequences. The points that are visible in two over- 
lapping sub-sequences are used to solve for absolute orientation that relate the co- 
ordinate systems of the two sub- sequences. This is used to represent both the sub- 
sequences in a common coordinate system, so that they may be stitched together. 

5. Refinement of poses over sequence: The stitching of subsequences allows the repre- 
sentation of poses and the 3D points in a single coordinate system. However, point 
tracks that are common across more than one sub-sequence provide constraints for 
further global adjustment of the 3D parameters. In the final step, bundle block ad- 
justment is applied to the complete set of frames and 3D points. For computational 
efficiency, this adjustment can be applied in small sets of arbitrary frames or can 
be applied to the complete set. The interesting aspect of the representation here is 
that any combination of internal parameters, pose parameters or 3D points can be 
adjusted while maintaining a global representation. 



4.2 3D Insertion Using Computed Poses 

Accurate and stable pose estimation important for the application of 3D match move in 
which synthetic objects are inserted into real images and their viewpoints are mimicked 
according to the real camera’s pose and internal parameters. Now we show examples of 
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Fig. 5. Three frames of the Garden sequence with synthetic 3D flamingoes inserted. The 3D 
placement was done only in one frame, and all others were generated through rendering using the 
automatically computed poses. 



Annotation of Video by Alignment to Reference Imagery 
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3D match move using the pose estimation technique presented in the earlier part of the 
paper. 

Fig.EI shows three frames from a garden sequence with synthetic 3D flamingos 
inserted. The 3D placement was performed manually only in one frame, and all others 
were generated through rendering using the automatically computed poses. It is to be 
emphasized that the placement of the synthetics has been done with respect to one 
frame only. Therefore, there is minimal interaction demanded of the user. It is also to be 
emphasized that both in the still image displays as well as in the videos, no drift or jitter 
in the objects in any of the frames is noticeable. This is a qualitative visual validation of 
the stability of pose and 3D structure computation of the algorithm developed. 
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Discussion 

Rick Szeliski: You showed the sequence with the flamingo inserted. How much effort 
would it be today, to insert a flamingo behind the fence — with occlusions, and not the 
situation you planned for? 

Keith Hanna: That’s a good question. Obviously one way to do it would be to recover 
or build a 3D model of the scene. But you need to compare how long it takes to make the 
3D model with someone going in and delineating the occlusion by hand. In Hollywood, 
they do a lot by hand, so if we want to get more into Hollywood, we need to understand 
what they do. Here they’d probably do a pose estimation then delineate by hand, maybe 
with the help of tracking tools. As computer vision people we’d like to build a 3D model 
and then do everything ourselves, but in practice in many applications we’d be wasting 
our time actually trying to do that. 

Yongduek Seo: The graphics object inserted in the video scene is very stable. Do you 
have any special method? 

Keith Hanna: Basically all the approaches use iterative refinement. You recover a first 
estimate of your poses, and depths, then use those to refine iferafively, fo get closer 
to registration. The reason for the accuracy is just continual iterative refinement of the 
parameters to improve the alignment. Five or six iterations is usually enough. 
Yongduek Seo: How large are the images? 

Keith Hanna: Just regular sizes, 720 x 480 here I think. All the video stuff is standard 
768 X 480. The movie ones are bigger, about Ik. 

Luc Robert: First I’d like to add to your answer to Rick’s question. Hollywood people 
are very skilled, but sometimes they simply can’t do it by hand, or it’s too painful like 
painting individual pixels. So, 3D maps are something they use a lot. They often build 
3D models just to predict binary image maps for compositing. That would be a good 
solution for the fence. 

I have a question about the example with the baseball player moving in front of the 
billboard. You clearly handle the occlusion, but the background is green. Can you also 
do it against an arbitrary, non-color-key-able background? 

Keith Hanna: When there is a lot of background texture, detecting the difference is 
much harder, and any slight misalignment shows up very clearly. Even a tenth of a pixel 
is visible if you have a high-contrast edge. There are other constraints you can use, like 
motion or spatial continuity, but typically it’s much easier when the background is flat. 

Another interesting thing you may have seen is video insertion on the ground in 
soccer. There we do a video mixing of about 75% logo and 25% grass background. 
There are a couple of reasons for that. One is that the texture of the grass comes through, 
so it really looks like the logo is painted on the grass. In soccer it looks great actually, 
we were all blown away when we saw that! But the second reason is that it makes the 
occlusion analysis a little easier — errors are less noticeable if you do a little mixing. 




Computer-Vision for the Post-production World: 
Facts and Challenges 
through the REALViZ Experience 



Luc Robert 

REALViZ S.A., BP037, 06901 Sophia Antipolis, France 
Luc . Robert Orealviz . com 

Luc Robert described the products being developed at REALViZ and their 
application to special effects and post production in the film industry. In par- 
ticular the MatchMover product which is a “3-D Camera Tracker” computing 
a camera for each frame of the film by tracking 2-D features through an image 
sequence. We include below the discussion following his presentation. 

Harpreet Sawhney: In the case of 2D tracking it’s easy to figure out when 
things start going wrong. But when you go to MatchMoving with camera esti- 
mation and so forth, how do you tell a non-expert user what to do when the 
computation has gone wrong? 

Luc Robert: Well, you write lots of pages of documentation, hoping that the 
user won’t have to read them. . . Yes, this is one of the most difficult parts of 
MatchMoving, making things clear to the user when they’re sometimes not even 
clear to us. You can provide survey tools that inspect image residuals, time 
averages, anything that might be useful. But beyond that it’s not clear. 

Yongduek Seo: I did some similar work, and in that case, although the results 
were very good in parameter space, we still found some trembling and things 
like that in the real video. Does this happen with your system? Also, you are 
using Euclidean parameters, I wonder if you have considered a calibration-free 
approach. 

Luc Robert: As far as the first question is concerned, yes, even with our 

software, the thing sometimes wiggles and the trajectories are a little shaky. 
So you have to provide the user with tools for smoothing out parameters, or 
trying to fix things by hand if necessary. One case where it often happens is 
when you compute a trajectory with variable zoom. There is a near-ambiguity 
between zooming and moving forward, so the estimated camera trajectory can 
be very jagged in the z-direction. But you can usually fix that by applying a 
smoothing filter to the zoom parameter and recomputing the other parameters. 
If the objects you insert in the scene are exactly between the points you track, 
it’s usually OK even without smoothing. But when you add ‘noise amplifiers’ 
— objects which go much further than the points you’ve tracked — you start 
seeing vibrations and for this filtering is pretty efficient. I don’t like it because 
it means that the algorithm has in some sense failed, but it’s the best solution 
we’ve found up till now. 
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Fig. 1. A synthetic car inserted in a real scene using the REALViZ software. 



As for the second question, unfortunately we have to fit into the standard 
production pipeline. So we get images here and we produce camera files there. 
The camera files have to be read by animation packages, which do not handle 
projective cameras. So we have to do Euclidean stuff. Otherwise no one would 
be able to use the results. 

Joe Mundy: Given that a person in the loop is essential in most circumstances, 
how do you think the computer vision community should be thinking about 
algorithms to best take advantage of human interaction? 

Luc Robert: I don’t really know. I think that when things are not very clear 
to you, you can’t explain them to someone else, and usually the reverse is true. 
In MatchMoving, the MatchMovers are usually cameramen, because they un- 
derstand best how a synthetic camera is looking at the world. 

Joe Mundy: What this suggests to me is that the user-interface needs to be 
much more tightly coupled to the computer vision algorithms. Rather than just 
pointing, there has to be a deeper interaction with the imagery. It seems to me 
that perhaps we should start to look more carefully at what people have done in 
human factors research. There’s been a tremendous amount of research for fields 
like piloting aircraft, and maybe an invited talk at a computer vision conference 
on that material would be useful. 
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1 Introduction 

This report provides a brief summary of the review of “Direct Methods” , which 
was presented by Michal Irani and P. Anandan. 

In the present context, we define “Direct Methods” as methods for motion 
and/or shape estimation, which recover the unknown parameters directly from 
measurable image quantities at each pixel in the image. This is contrast to the 
“feature-based methods”, which first extract a sparse set of distinct features 
from each image separately, and then recover and analyze their correspondences 
in order to determine the motion and shape. Feature-based methods minimize an 
error measure that is based on distances between a few corresponding features, 
while direct methods minimize an error measure that is based on direct image 
information collected from all pixels in the image (such as image brightness, or 
brightness-based cross-correlation, etc). 

2 The Brightness Constraint 

The starting point for most direct methods is the “brightness constancy con- 
straint”, namely, given two images J{x,y) and I{x,y), 

J{x,y) = I{x + u{x,y),y + v{x,y)), 

where (x,y) are pixel coordinates, and (u,v) denotes the displacement of pixel 
{x,y) between the two images. Assuming small (u,v), and linearizing / around 
(x,y), we can obtain the following well-established constraint 0: 



I^u + lyV + It — 0, ( 1 ) 

where are the spatial derivatives of the image brightness, and It = I — 

J. All the quantities in these equations are functions of image position {x,y), 
hence every pixel provides one such equation that constrains the displacement 
of that pixel. However, since the displacement of each pixel is defined by two 
quantities, u and v, the brightness constraint alone is insufficient to determine 
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the displacement of a pixel. A second constraint is provided by a “global motion 
model” , namely a model that describes the variation of the image motion across 
the entire image. These models can be broadly divided into two classes: Two- 
dimensional (2D) motion models and three-dimensional (3D) motion models. 
Below we describe how direct methods have been used in connection with these 
two classes of models. A more complete description of a hierarchy of different 
motion models can be found in 

3 2D Global Motion Models 

The 2D motion models use a single global 2D parametric transformation to 
define the displacement of every pixel contained in their region of support. A 
frequently used model is the affine motion modeQ, which is described by the 
equations: 

u{x, y) = ai + Q 2 X + a^y 

v{x,y) = aA + a^x + any (2) 

The affine motion model is a very good approximation for the induced image 
motion when the camera is imaging distant scenes, such as in airborne video or 
in remote surveillance applications. Other 2D models which have been used by 
direct methods include the Quadratic motion model nm, which describes the 
motion of a planar surface under small camera rotation, and the 2D projective 
transformation (a homography) m, which describes the exact image motion of 
an arbitrary planar surface between two discrete uncalibrated perspective views. 

The method of employing the global motion constraint is similar, regardless 
of the selected 2D global motion model. As an example, we briefly describe here 
how this is done for the affine transformation. 

We can substitute the affine motion of Equation |2| into the brightness con- 
straint in Equation Q to obtain, 

Ix{ai + Q2X + asy) + Iy{ai + a^x + a^y) + It = Q. (3) 

Thus each pixel provides one constraint on the six unknown global parameters 
(ai,...,ae). Since these parameters are global (i.e., the same parameters are 
shared by all the pixels), therefore, theoretically, six independent constraints 
from six different pixels are adequate to recover these parameters. In practice, 
however, the constraints from all the pixels within the region of analysis (could 
be the entire image) are combined to minimize the error: 

E{ai, . . . , oe) = ^(G(ai -I- G2X + a^y) + /y(o4 -I- a^x + a^y) + Itf (4) 

Note that different pixels contribute differently to this error measure. For exam- 
ple, a pixel along a horizontal edge in the image will have significant ly, but zero 

^ The affine transformation accurately describes the motion of a an arbitrary planar 
surface for a fully rectified pair of cameras - i.e., when the optical axes are parallel 
and the baseline is strictly sideways. 
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Ix, and hence will only constrain the estimation of the parameters (04,05,06) 
and not the others. Likewise a pixel along a vertical edge will only constrain 
the estimation of the parameters (01,02,03). On the other hand, at a corner-like 
pixel and within a highly textured region, both the components of the gradient 
will be large, and hence the pixel will constrain all the parameters of the global 
affine transformation. Finally, a pixel in a homogeneous area will contribute little 
to the error since the gradient will be very small. 

In other words, the direct methods use information from all the pixels, weight- 
ing the contribution of each pixel according to the underlying image structure 
around that pixel. This eliminates the need for explicitly recovering distinct 
features. In fact, even images which contains no distinct feature points can be 
analyzed, as long as there is sufficient image gradient along different directions 
in different parts of the image. 



4 Coarse-to-Fine Iterative Estimation 

The basic process described above relies on linearizing the image brightness 
function (Equation Q). This linearization is a good approximation when (u,v) 
are small (e.g., less than one pixel). However, this is rarely satisfied in real 
video sequences. The scope of the direct methods has therefore been extended 
to handle a significantly larger range of motions via coarse-to-fine processing, 
using iterative refinement within a multi-resolution pyramid. 

The basic observation behind coarse-to-fine estimation is that given proper 
filtering and subsampling, the induced image motion decreases as we go from 
full resolution images (fine pyramid levels) to small resolution images (coarse 
pyramid levels). The analysis starts at the coarsest resolution level, where the 
image motion is very small. The estimated global motion parameters are used to 
warp one image toward the other, bringing the two images closer to each other. 
The estimation process is then repeated between the warped images. Several it- 
erations (typically 4 or 5 ) of warping and refinement are used to further increase 
the search range. After a few iterations, the parameters are propagated to the 
next (finer) pyramid level, and the process is repeated there. This iterative-refine 
estimation process is repeated and propagated all the way up to the finest reso- 
lution level, to yield the final motion parameters. A more complete description 
of the coarse-to-fine approach can be found in 

With the use of coarse-to-fine refinement, direct methods have been extended 
to handle image motions typically upto 10-15 percent of the image size. This 
range is more than adequate for handling the type of motions found in real video 
sequences. Direct methods are also used for aligning images taken by different 
cameras, whose degree of misalignment does not exceed the abovementioned 
range. For larger misalignments, an initial estimate is required. 
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5 Properties of Direct Methods 

In addition to the use of constraints from all the pixels, weighted according to the 
information available at each pixel, direct methods have a number of properties 
that have made them attractive in practice. Here we note three of these: (i) high 
sub-pixel accuracy, and (ii) the “locking property”, and (iii) dense recovery of 
shape in the case of 3D estimation. Properties (i) and (ii) are briefly explained 
in this section, while property (iii) is referred to in Section 0 



5.1 Sub-pixel Accuracy 



Since direct methods use “confidence-weighted” local constraints from every 
pixel in the image to estimate a few global motion parameters (typically 6 or 8), 
these parameters are usually estimated to very high precision. As a result, the 
displacement vector induced at each pixel by the global motion model is precise 
upto a fraction of a pixel (misalignment error is usually less than 0.1 pixel). 
This has led to its use in a number of practical situations including mosaicing 
flll0ll8iT7| . video enhancement El], and super resolution El, all of which 
require sub pixel alignment of images. Figure 0 shows an example of a mosaic 
constructed by aligning a long sequence of video frames using a direct method 
with a frame-to-frame affine motion model. Note that the alignment is seamless. 
Figure El shows an example of video enhancement. Note the improvement in the 
fine details in the image, such as the windows of the building. For examples of 
Super- Resolution using direct image alignment see El- 



5.2 Locking Property and Outlier Rejection 

Direct methods can successfully estimate global motion even in the presence of 
multiple motions and/or outliers. Burt, et. al. jSj used a frequency-domain anal- 
ysis to show that the coarse-to-flne refinement process allows direct methods to 
“lock-on” to a single dominant motion even when multiple motions are present. 
While their analysis focuses on the case of global translation, in practice, di- 
rect methods have been successful of handling outliers even for affine and other 
parametric motions. Irani et. al. m achieved further robustness by using an 
iterative reweighting approach with an outlier measure that is easy to compute 
from image measurements. Black and Anandan E| used M-estimators to recover 
the dominant global motion in the presence of outliers. Figure 0) (from [TT?j l 
shows an example of dominant motion selection, in which the second motion (a 
person walking across the room) occupies a significant area of the image. Other 
examples of dominant motion selection can be found in a number of papers in 
the literature (e.g., see Emni). 



6 3D Motion Models 

So far, we have focused on using direct methods for estimating global 2D para- 
metric motions. In these cases, a small number (typically 6 or 8) parameters 
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Fig. 1. Panoramic mosaic of an airport video clip, (a) A few represen- 
tative frames from a one-minute-long video clip. The video shows an airport 
being imaged from the air with a moving camera, (b) The mosaic image built 
from all frames of the input video clip. Note that the alignment is seamless. 




Fig. 2. Video enhancement, (a) One ont of 20 noisy frames (all frames 
are of similar quality), (b) The corresponding enhanced frame in the enhanced 
video sequence (all the frames in the enhanced video are of the same quality). 





272 



M. Irani and P. Anandan 




Fig. 3. Dominant motion selection and outlier rejection. (a) 3 

representative frames from the sequence. There are two motions present - that 
induced by the panning camera, and that induced by the walking woman. 

(b) Outlier pixels detected in those frame are marked in blacks. Those are 
pixels found to be moving inconsistently with the detected dominant motion. 
Those pixels correspond to the walking woman, to her reflection in the desk, 
to the boundaries of the image frames, and to some noisy pixels, (c) Full 
reconstructions of the dominant layer (the background) in all frames. The girl, 
her reflection, and the noise are removed from the video sequence by filling in 
the black regions with gray-level information from other frames according to 
the computed dominant background motion. 

can describe the motion of every pixel in the region consistent with the global 
motion. However, these 2D motion models cannot model frame-to-frame motion 
when significant camera translation and non-planar depth variations are present. 
These scenarios require 3D motion models. The 3D motion models consist of two 
sets of parameters: a set of global parameters, which represent the effects of cam- 
era motion, and a set of local parameters (one per pixel), which represents the 
3D structure or the “shape’Q. Examples of 3D motion models include: 

(i) The instantaneous velocity field model: 

u = -xyQx + (1 + x^)fiY - V^z + {Tx - Tzx)jZ 
V = -(1 -I- y^)^x + xyflr + xflz + (Ty - Tzy)!Z, 

where (f?x, ^z) and {Tx,Ty,Tz) denote the camera rotation and transla- 

tion parameters, and Z the depth value represents the local shape. 



^ These types of 3D models are referred to as “quasi-parametric” models in pp. 
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(ii) The discrete 3D motion model, parameterized in terms of a homography and 
the epipole: 

hix + h2y + h.3 + 7^1 

hjx + hs,y + hg + 7^3 

h^x + h^y + hg + 7^2 

V = — y 

hjx + h^y + hg + jt3 

where {hi, . . . ,hg) denote the parameters of the homography, {ti,t 2 ,tg) repre- 
sents the epipole in homogeneous coordinates, and 7 represents the local shape. 

(iii) The plane-l-parallax model: 

u = x^ — X = - — - — {tgx — ti) 

1 + 7^3 

v = y^ -y= 2 i “ ^ 2 ) 

1 -I- 7t3 

where (a:™, ?/“) correspond the image locations obtained after warping the image 
according to the induced homography (2D projective transformation) of a dom- 
inant planar surface (See ITMT)] for more details). Direct methods have been 
applied in conjunction with 3D motion models to simultaneously recover the 
global camera motion parameters and the local shape parameters from image 
measurements. For example, m have used the instantaneous velocity equations 
to recover the camera motion and shape from two and multiple images. Szeliski 
and Kang HOI directly recovered the homography, the epipole, and the local 
shape from image intensity variations, and Kumar et. al. uni and Irani et. al. 
m have applied direct methods using the plane-l-parallax model with two and 
multiple frames, respectively. 

All of these examples of using direct methods with 3D motion models use 
multi-resolution coarse-to-fine estimation to handle large search ranges. The 
computational methods are roughly similar to each other and are based on the 
approach described in for quasi-parametric model. 

Figure 0 shows an example of applying the plane-l-parallax model to the 
“block sequence” ng. These results were obtained using the multiframe tech- 
nique described in mu. A natural outcome of using the direct approach with 
a 3D motion model is the recovery of a dense shape map of the scene, as is 
illustrated in Figure 0 Dense recovery is made possible because at every pixel 
the Brightness Constancy Equation Q provides one line constraint, while the 
epipolar-constraint provides another line constraint. The intersection of these 
two line constraints uniquely defines the displacement of the pixel. Other ex- 
amples of using direct methods for dense 3D shape and motion recovery can be 
found in the various papers cited above. 

7 Handling Changes in Brightness 

Since the brightness constancy constraint is central to the direct methods, a 
natural question arises concerning the applicability of these techniques when the 
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Fig. 4. Shape recovery using the Plane+Parallax model. (a) One 

frame from the sequence, (b) The recovered shape (relative to the carpet 
plane). Brighter values correspond to taller points. 



brightness of a pixel is not constant over multiple images. There are two ways 
of handling such changes. The first approach is to renormalize the image inten- 
sities to reduce the effects of such changes in brightness over time. For example, 
normalizing the images to remove global changes in mean and contrast often 
handles effects of overall lighting changes. More local variations can be handled 
by using Laplacian pyramid representations and by applying local contrast nor- 
malizations to the Laplacian filtered images (see for a real-time direct affine 
estimation algorithm which uses Laplacian pyramid images together with some 
local contrast normalization). 

A second (and more recent) approach to handling brightness variation is to 
generalize the entire approach to use other local match measures besides the 
brightness error. This approach is discussed in more detail in Section El 

8 Other Local Match Measures 

Irani and Anandan E] describe a general approach for extending direct methods 
to handle any user defined local match measure. In particular, instead of ap- 
plying the linearization and the iterative refinement to brightness surfaces, the 
regression in |H] is applied directly to normalized- correlation surfaces, which are 
measured at every pixel in the image. A global affine transformation is sought, 
which simultaneously maximizes as many local correlation values as possible. 
This is done without prior commitment to particular local matches. The choice 
of local displacements is constrained on one hand by the global motion model 
(could be a 2D affine transformation or a 3D epipolar constraint), and on the 
other hand by the local correlation variations. 

Irani and Anandan show that with some image pre- filtering, the direct corre- 
lation based approach can be applied to even extreme cases of image matching. 
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such as multi-sensor image alignment. Figure 0 shows the results of applying 
their approach to recovering a global 2D affine transformation needed to align 
an infra-red (IR) image with an electro-optic (video) image. More recently, Man- 
delbaum, et. al. m have extended this approach to simultaneously recover the 
3D global camera motion and the dense local shape. 

9 Summary 

In this paper we have briefly described the class of methods for motion estimation 
called direct methods. Direct methods use measurable image information, such 
as brightness variations or image cross-correlation measures, which is integrated 
from all the pixels to recover 2D or 3D information. This is in contrast to feature- 
based methods that rely on the correspondence of a sparse set of highly reliable 
image features. 

Direct methods have been used to recover 2D global parametric motion mod- 
els (e.g., affine transforms, quadratic transforms, or homographies) , as well as 
3D motion models. In the 3D case, the direct methods recover the dense 3D 
structure of the scene simultaneously with the camera motion parameters (or 
epipolar geometry). Direct methods have been shown to recover pixel motion 
upto high subpixel precision. They have also been applied to real-image se- 
quences containing multiple motions and outliers, especially in the case of 2D 
motion models. The recent use of cross-correlation measures within direct meth- 
ods have extended their applicability to image sequences containing significant 
brightness variations over time, as well as to alignment of images obtained by 
sensors of different sensing modalities (such as IR and video). Direct methods 
are capable of recovering misalignments of up to 10-15 % of the image size. For 
larger misalignments, an initial estimate is required. 
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Fig. 5. Multi-sensor Alignment. (a) EO (video) image, (b) IR (Infra- 
Red) image, (c) Composite (spliced) display before alignment, (d) Composite 
(spliced) display after alignment. Note in particular the perfect alignment of 
the water-tank at the bottom left of the images, the building with the arched- 
doorway at the right, and the roads at the top left of the images. 
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1 Introduction 

This report is a brief overview of the use of “feature based” methods in structure 
and motion computation. A companion paper by Irani and Anandan m reviews 
“direct” methods. 

Direct methods solve two problems simultaneously: the motion of the camera 
and the correspondence of every pixel. They effect a global minimization using 
all the pixels in the image, the starting point of which is (generally) the image 
brightness constraint (as explained in the companion paper). 

By contrast we advocate a feature based approach. This involves a strategy 
of concentrating computation on areas of the image where it is possible to get 
good correspondence, and from these an initial estimate of camera geometry 
is made. This geometry is then used to guide correspondence in regions of the 
image where there is less information. Our thesis is as follows: 

Structure and motion recovery should proceed by first extracting features, and 
then using these features to compute the image matching relations. It should not 
proceed by simultaneously estimating motion and dense pixel correspondences. 

The “image matching relations” referred to here arise from the camera motion 
alone, not from the scene structure. These relations are the part of the motion 
that can be computed directly from image correspondences. For example, if the 
camera translates between two views then the image matching relation is the 
epipolar geometry of the view-pair. 

The rest of this paper demonstrates that there are cogent theoretical and 
practical reasons for advocating this thesis when attempting to recover structure 
and motion from images. We illustrate the use of feature based methods on two 
examples. First, in section El we describe in detail a feature based algorithm for 
registering multiple frames to compute a mosaic. The frames are obtained by a 
camera rotating about its centre, and the algorithm estimates the point-to-point 
homography map relating the views. Second, section Oldiscusses the more general 
case of structure and motion computation from images obtained by a camera 
rotating and translating. Here it is shown how feature matching methods form 
the basis for a dense 3D reconstruction of the scene (where depth is obtained for 
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every pixel) . Section 0 explores the strengths and weaknesses of feature based 
and direct methods, and summarises the reasons why feature based methods 
perform so well. 

2 Mosaic Computation 

Within this section the feature based approach to mosaicing is described. Given a 
sequence of images acquired by a camera rotating about its centre, the objective 
is to fuse together the set of images to produce a single panoramic mosaic im- 
age of the scene. For this particular camera motion, corresponding image points 
(i.e. projections of the same scene point) are related by a point-to-point planar 
homography map which depends only on the camera rotation and internal cali- 
bration, and does not depend on the scene structure (depth of the scene points) . 
This map also applies if the camera “pans and zooms” (changes focal length 
whilst rotating). 

A planar homography (also known as a plane projective transformation, or 
collineation) is specified by eight independent parameters. The homography is 
represented as a 3 x 3 matrix that transforms homogeneous image coordinates 
as: 

X = Hx. 

We first describe the computation of a homography between an image pair, and 
then show how this computation is extended to a set of three or more images. 

2.1 Image Pairs 

The automatic feature based algorithm for computing a homography between 
two images is summarized in tabled with an example given in figure ^ 

The point features used (developed by Harris 1121 ') are known as interest 
points or “corners”. However, as can be seen from figure □ (c) & (d), the term 
corner is misleading as these point features do not just occur at classical corners 
(intersection of lines). Thus we prefer the term interest point. Typically there 
can be hundreds or thousands of interest points detected in an image. 

It is worth noting two things about the algorithm. First, interest points are 
not matched purely using geometry - i.e. only using a point’s position. Instead, 
the intensity neighbourhood of the interest point is also used to rank possi- 
ble matches by computing a normalized cross correlation between the point’s 
neighbourhood and the neighbourhood of a possible match. Second, robust es- 
timation methods are an essential part of the algorithm: more than 40% of the 
putative matches between the interest points (obtained by the best cross cor- 
relation score and proximity) are incorrect. It is the RANSAC algorithm that 
identifies the correct correspondences. 

Given the inlying interest point correspondences: {x^ -o- x'}, i = 1 . . . n, the 
final estimate of the homography is obtained by minimizing the following cost 
function. 



^d(x„Xi)2 -kd(x',x')2 



( 1 ) 
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Fig. 1. Automatic computation of a homography between two images using 
the algorithm of tabled (a) (b) Images of Keble College, Oxford. The motion 
between views is a rotation about the camera centre so the images are exactly related 
by a homography. The images are 640 x 480 pixels, (c) (d) Detected point features 
superimposed on the images. There are approximately 500 features on each image. The 
following results are superimposed on the left image: (e) 268 putative matches shown 
by the line linking matched points, note the clear mismatches; (f) RANSAC outliers — 
117 of the putative matches; (g) RANSAC inliers — 151 correspondences consistent 
with the estimated H; (h) final set of 262 correspondences after guided matching and 
MLE. The estimated H is accurate to subpixel resolution. 
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Table 1. The main steps in the algorithm to automatically estimate a homography 
between two images using RANSAC and features. Further details are given in M- 



Objective Compute the 2D homography between two images. 

Algorithm 

1. Features: Compute interest point features in each image to sub pixel accuracy 
(e.g. Harris corners US). 

2. Putative correspondences: Compute a set of interest point matches based 
on proximity and similarity of their intensity neighbourhood. 

3. RANSAC robust estimation: Repeat for N samples 

(a) Select a random sample of 4 correspondences and compute the homography 

H. 

(b) Calculate a geometric image distance error for each putative correspon- 
dence. 

(c) Compute the number of inliers consistent with H by the number of corre- 
spondences for which the distance error is less than a threshold. 

Choose the H with the largest number of inliers. 

4. Optimal estimation: re-estimate H from all correspondences classified as in- 
liers, by minimizing the maximum likelihood cost function m using a suitable 
numerical minimizer (e.g. the Levenberg-Marquardt algorithm [ 221 ). 

5. Guided matching: Further interest point correspondences are now determined 
using the estimated H to define a search region about the transferred point 
position. 

The last two steps can be iterated until the number of correspondences is stable. 



where d(x, y) is the geometric distance between the image points x and y. The 
cost is minimized over the homography H and corrected points {x^} such that 
— Hxi. This gives the maximum likelihood estimate of the homography 
under the assumption of Gaussian measurement noise in the position of the 
image points. A fuller discussion of the estimation algorithm is given in P! 
with variations and improvements (the use of MLESAC rather than RANSAC) 
given in pg. 

2.2 Prom Image Ppairs to a Mosaic 

The two frame homography estimation algorithm can readily be extended to 
constructing a mosaic for a sequence as follows: 

1. Compute interest point features in each frame. 

2. Compute homographies and correspondences between frames using these 
point features. 

3. Compute a maximum likelihood estimate of the homographies and points 
over all frames. 

4. Use the estimated homographies to map all frames onto one of the input 
frames to form the mosaic. 
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In computing the maximum likelihood estimate the homographies are parametriz- 
ed to be consistent across frames. So, for example, the homography between the 
first and third frame is obtained exactly from the composition of homographies 
between the first and second, and second and third as H13 = H23H12. This is 
achieved by computing all homographies with respect to a single set of corrected 
points Xi- Details are given in jjj. 

The application of this algorithm to a 100 frame image sequence is illustrated 
in figures 0 and 0 The result is a seamless mosaic obtained to subpixel accuracy. 




b 

Fig. 2. Automatic panoramic mosaic construction, (a) Every 10th frame of a 100 
frame sequence acquired by a hand held cam-corder approximately rotating about its 
lens centre. Note, each frame has a quite limited field of view, and there is no common 
overlap between all frames, (b) The computed mosaic which is seamless, with frames 
aligned to subpixel accuracy. The computation method is described in [ 3 . 
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b 

Fig. 3. Details of the mosaic construction of figure (a) 1000 of the 2500 
points used in the maximum likelihood estimation, note the density of points across 
the mosaic, (b) Every 5th frame (indicated by its outline), note again the lack of frame 
overlap. A super-resolution detail of this mosaic is shown in figure 0 
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The mosaic can then form the basis for a number of applications such as video 
summary ca. motion removal auto-calibration and super-resolution. 
For example, figure 01 shows a super-resolution detail of the computed mosaic. 
The method used 0 is based on MAP estimation, which gives a slight improve- 
ment over the generally excellent Irani and Peleg algorithm m- 




Fig. 4. Super-resolution detail of the mosaic of figure The super-resolution 
image is built from a set 20 images obtained from partially overlapping frames. The 
original frames are jpeg compressed, (a) One of the set of images used as input for 
the super-resolution computation. The image is 130 x 110 pixels, and has the highest 
resolution of the 20 used, (b) The computed double resolution image (260 x 220 pixels). 
Note, the reduction in aliasing (e.g. on the dark bricks surrounding the Gothic arch) 
and the improvement in sharpness of the edges of the brick drapery. Details of the 
method are given in |S|. 



To summarize: in this case the “image matching relation” is a homography 
which is computed from point feature correspondences. Once the homography 
is estimated the correspondence of every pixel is determined. 

3 Structure and Motion Computation 

This section gives an example of (metric) reconstruction of the scene and cameras 
directly from an image sequence. This involves computing the cameras up to a 
scaled Euclidean transformation of 3-space (auto-calibration) and a dense model 
of the scene. The method proceeds in two overall steps: 

1. Compute cameras for all frames of the sequence. 

2. Compute a dense scene reconstruction with correspondence aided by the 
multiple view geometry. 

Unlike in the mosaicing example, in this case the camera centre is moving 
and consequently the map between corresponding pixels depends on the depth 
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of the scene points, i.e. for general scene structure there is not a simple map 
(such as a homography) global to the image. 

3.1 Computing Cameras for an Image Sequence 

The method follows a similar path to that of computing a mosaic. 

1. Features: Compute interest point features in each image. 

2. Multiple view point correspondences: Compute two- view interest point 
correspondences and simultaneously the fundamental matrix F between pairs 
of frames, e.g. using robust estimation on minimal sets of 7 points, as de- 
scribed in m, Compute three-view interest point correspondences and si- 
multaneously the trifocal tensor between image triplets, e.g. using robust 
estimation on minimal sets of 6 points, as described in m-, Weave together 
these 2-view and 3-view reconstructions to get an initial estimation of 3D 
points and cameras for all frames m- This initial reconstruction provides 
the basis for bundle adjustment. 

3. Optimal estimation: Compute the maximum likelihood estimate of the 
3D points and cameras by minimizing reprojection error over all points. 
This is bundle adjustment and determines a projective reconstruction. The 
cost function is the sum of squared distances between the measured image 
points X* and the projections of the estimated 3D points using the estimated 
cameras: 

min ^d(P*Xj,x;)2 (2) 

Pbx, ,, 

'' i , . . . . . 

where P is the estimated camera matrix for the ith view, Xj is the jth 
estimated 3D point, and d(x, y) is the geometric image distance between the 
homogeneous points x and y. 

4. Auto-calibration: Remove the projective ambiguity in the reconstruction 
using constraints on the cameras such as constant aspect ratio, see e.g. 1231 . 

Further details of automatic computation of cameras for a sequence are given 



II SI mm IE! m siiisii 



3.2 Computing a Dense Reconstruction 

Given the cameras, the multi-view geometry is used to help solve for dense 
correspondences. There is a large body of literature concerning methods for 
obtaining surface depths given the camera geometry: a classical stereo algorithm 
may be used, for example an area based algorithm such as PED] : or space carving, 
e.g.|2lf28j: or surface primitives may be fitted directly, e.g. piecewise planar 
models 0 or piecewise generalized cylinders HSI; or optical flow may be used, 
constrained by epipolar geometry pg. 

Figure 0 shows an example of automatic camera recovery from five images, 
followed by automatic dense stereo reconstruction using an area based algorithm. 
The method is described in 1221. 
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Fig. 5. Automatic generation of a texture mapped 3D model from an image 
sequence. The input is a sequence of images acquired by a hand held camera. The 
output is a 3D VRML model of the cameras and scene geometry, (a)-(c) three of the 
five input images, (d) and (e) two views of a metric reconstruction computed from 
interest points matched over the five images. The cameras are represented by pyramids 
with apex at the computed camera centre, (f) and (g) two views of the texture mapped 
3D model computed from the original images and reconstructed cameras using an area 
based stereo algorithm. Figures courtesy of Marc Pollefeys, Reinhard Koch, and Luc 
Van Gool 
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To summarize: the key point is that features are a convenient intermedi- 
ate step from input images to dense 3D reconstruction. In this case the “image 
matching relations” are epipolar geometry, trifocal geometry, etc that may be 
computed from image interest point correspondences. This is equivalent to re- 
covering the cameras up to a common projective transformation of 3-space. After 
computing the cameras, the features need not be used at all in the subsequent 
dense scene reconstruction. 

4 Comparison of the Feature Based and Direct Methods 

Within this section the two methods are contrasted. We highlight three aspects 
of the structure and motion recovery problem: invariance, optimal estimation, 
and the computational efficiency of the methods. Then list the current state of 
the art. 

4.1 The Importance of Invariance 

Features have a wide range of photometric invariance. For example, although thus 
far we have only discussed interest points, lines may be detected in an image as an 
intensity discontinuity (an “edge”). The invariance arises because a discontinuity 
is still detectable even under large changes in illumination conditions between 
two images. Features also have a wide range of geometric invariance - lines 
are invariant to projective transformations (a line is mapped to a line), and 
consequently line features may be detected under any projective transformation 
of the image. 

In the case of Harris interest points, the feature is detected at the local min- 
ima of an autocorrelation function. This minima is also invariant to a wide range 
of photometric and geometric image transformations, as has been demonstrated 
by Schmid et al m- Consequently, the detected interest point across a set of 
images corresponds to the same 3D point. 

This photometric and geometric invariance is perhaps the primary motivation 
for adopting a feature based approach. 

In direct approaches it is necessary to provide a photometric map between 
corresponding pixels, for example the map might be that the image intensities are 
constant (image brightness constraint), or that corresponding pixels are related 
by a monotonic function. If this map is incorrect, then erroneous correspondences 
between pixels will result. In contrast in feature based methods a photometric 
map is used only to guide interest point correspondence. 

As an example, consider normalized cross correlation on neighbourhoods. 
This measure is used in the algorithms included in this paper to rank matches. 
Normalized cross-correlation is invariant to a local affine transformation of in- 
tensities (scaling plus offset). If, in a particular imaging situation, the cross- 
correlation measure is not invariant to the actual photometric map between the 
images, then the ranking of the matches may be erroneous. However, the position 
of the interest points is (largely) invariant to this photometric map. Thus with 
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the immunity to mismatches provided by robust estimation, the transformation 
estimated from the interest points will still be correct. It is for this reason that 
the estimated camera geometry is largely unaffected by errors in the model of 
(or invariance to) the photometric map. 

If normalized cross-correlation is used in a direct method, and is invariant to 
the actual photometric map, then in principle the correct pixel correspondence 
will be obtained. However, if it is not invariant to the actual photometric map, 
then direct methods may systematically degrade but feature based methods will 
not. 

As a consequence, feature based methods are able to cope with severe view- 
ing and photometric distortion, and this has enabled wide base line matching 
algorithms to be developed. An example is given in figure 01 




Fig. 6. Wide baseline matching. Three images acquired from very different view- 
points with a hand-held camera. The trifocal tensor is estimated automatically from in- 
terest point matches and a global homography affinity score. Five of the matched points 
are shown together with their corresponding epipolar lines in the second and third im- 
ages. The epipolar geometry is determined from the estimated trifocal tensor. Original 
images courtesy of RobotVis INRIA Sophia Antipolis. The wide baseline method is 
described in 



4.2 Optimal Estimation 

A significant advantage of the feature based approach is that it readily lends 
itself to a bundle adjustment method over a long sequence, and this provides 
a maximum likelihood estimate of the estimated quantities (homographies in 
the mosaic example, cameras in the structure and motion example). This re- 
veals a key difference between the feature based and direct methods: in feature 
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based approaches the errors are uncorrelated between features so that statistical 
independence is a valid assumption in estimation. 

Consider the “least squares” cost functions that are typically used (e.g. 0). 
For this to be a valid maximum likelihood estimation two criteria must be sat- 
isfied: first, each of the squares to be summed must be the log likelihood of that 
error, and second each must be conditionally independent of any of the other 
errors. In the case of feature based methods the sum of squared error that is min- 
imized is the distance between the backprojected reconstructed 3D point and its 
measured correspondence in each image. There is evidence that these errors are 
independent and distributed with zero mean in a Gaussian manner tI5l . The 
same cannot be said when using the brightness constraint equation to estimate 
global motion models m- Because the quantities involved are image deriva- 
tives obtained by smoothing the image there is a large amount of conditional 
dependence between the errors. It is not clear what effect this violation of the 
conditions for maximum likelihood estimation might be but it is possible that 
the results produced may be biased. 

To summarize: for direct methods it is not straightforward to write down 
a practical likelihood function for all pixels. Modelling of noise and statistics 
is much more complicated, and simple assumptions of independence invalid. 
Thus attempting a global minimization treating all the errors as if they were 
uncorrelated will lead to a biased result. 

4.3 Computational Efficiency and Convergence 

Consider computing the fundamental matrix from two views. Interest point cor- 
respondences yield highly accurate camera locations at little computational cost. 
If instead every pixel in the image is used to calculate the epipolar geometry the 
computational cost rises dramatically. Furthermore the result could not have 
been improved on as only pixels where the correspondence is well established 
are used (the point features). Use of every pixels means introducing much noisy 
data, as correspondence simply cannot be determined in homogeneous regions 
of the image either from the brightness constraint or from cross correlation. 
Thus the introduction of such pixels could potentially introduce more outliers, 
which in turn may cause incorrect convergence of the minimization. To deter- 
mine correspondence in these regions requires additional constraints such as local 
smoothness. 

To summarize: features can be thought of as a computational device to leap 
frog us to a solution using just the good (less noisy) data first, and then incor- 
porating the bad (more noisy) data once we are near a global minima. 

4.4 Scope and Applications 

Finally we list some of the current achievements of feature based structure from 
motion schemes and ask how direct methods compare with this. A list of this 
sort will of course date, but it is indicative of the implementation ease and 
computational success of the two approaches. 
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Automatic estimation of the fundamental matrix and trifoeal tensor. Point fea- 
tures facilitate automatic estimation of the fundamental matrix and trifocal 
tensor. There is a wide choice of algorithms available for interest points. These 
algorithms are based on robust statistics - this means that they are robust to 
effects such as occlusion and small independent motions in an otherwise rigid 
scene. 

The fundamental matrix cannot be estimated from normal flow alone. The 
trifocal tensor can but results comparable to the feature based algorithms 
have yet to be demonstrated. Direct methods can include robustness to minor 
occlusion and small independent motions. Although pyramid methods can be 
deployed to cope with larger disparities direct methods have met with only 
limited success with wide base line cases. 

Application to image sequences. Features have enabled automatic computation 
of cameras for extended video sequences over a very wide range of camera mo- 
tions and scenes. This includes auto-calibration of the camera. An example is 
shown in figure Q of camera computation with auto-calibration for hundreds of 
frames. 

In contrast direct methods have generally been restricted to scenes amenable 
to a “plane plus parallax” approach, i.e. where the scene is dominated by a plane 
so that homographies may be computed between images. 

Features other than points. Although this paper has concentrated on interest 
point features, other features such as lines and curves may also be used to com- 
pute multi- view relations. For example, figure |H1 shows an example of a homogra- 
phy computed automatically from an imaged planar outline between two views. 



5 Conclusions 

It is often said (by the unlearned) that feature based methods only furnish a 
sparse representation of the scene. This is missing the point, feature based 
methods are a way of initializing camera geometry /image matching relations so 
that a dense reconstruction method can follow. 

The extraction of features - the seeds of perception |S| - is an intermediate 
step, a computational artifice that culls the useless data and affords the use of 
powerful statistical techniques such as RANSAC and bundle adjustment. 

The purpose of this paper has not been to argue against the use of direct 
methods where appropriate (for instance in the mosaicing problem under small 
image deformations). Rather it has been to suggest that for more general struc- 
ture and motion problems, the currently most successful way to proceed is via 
the extraction of photometrically invariant features. The benefit being that just 
a few high information features can be used to find the correct ball park of the 
solution. Once this is found more information may be introduced, and a “direct” 
method can be used to improve the result. 
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b 

Fig. 7. Reconstruction from extended image sequences, (a) Six frames from 120 
frames of a helicopter shot, (b) Automatically computed cameras and 3D points. The 
cameras are shown for just the start and end frames for clarity, with the path between 
them indicated by the black curve. The computation method is described in nni. 
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Fig. 8. Computing homographies using curve features. The homography between 
the plane of the spanner in (a) and (b) is eomputed automatically from segments of the 
outline curve. This curve is obtained using a Canny-like edge detector. Note the severe 
perspective distortion in (b). The mapped outline is shown in (c). The computation 
method is robust to partial occlusion and involves identifying projectively covariant 
points on the curve such as bi-tangents and inflections. Details are given in 
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Discussion for Direct versus Features Session 



This section contains the discussion following the special panel session compar- 
ing direct and feature-based methods for motion analysis. The positions of the 
panelists are given in the previous two papers, by Irani & Anandan [Q, and by 
Torr & Zisserman PI- 

Discussion 

Harpreet Sawhney: I don’t think the issue is really feature based or direct 
methods. There are many intermediate situations between these two extremes. 
An example is Sarnoff’s VideoBrush. It can align multiple, distorted images with 
only 10-20% overlap between them. The initial step uses a direct method to 
determine translations which roughly align the images. This is based on a search 
over a correlation surface. Then, as in feature based methods, it bundle adjusts 
homographies to align all the images, but it does not suffer from the problem of 
feature based methods where the cost rises with the number of features. 
Andrew Zisserman: There’s certainly room for combining the methods and 
VideoBrush is a good example. In general though a direct method is suitable for 
estimating a global transformation by a restricted search over a small number 
of unknown parameters, such as the translation and rotation in the VideoBrush 
application. But if, for example, there is a a very severe projective transformation 
where it is necessary to estimate eight parameters, then a feature based method 
is really needed. 

Shmuel Peleg: As Harpreet Sawhney mentioned, feature based and direct 
methods are just two extremes. Take the example that Michal Irani showed, 
where a correlation surface is computed for every pixel. This means that each 
pixel can be thought of as a feature point. If you throw away the 90% of these 
pixels that don’t have a clear correlation maximum, then you have sparse fea- 
ture based matching. Or you can keep the entire correlation which is the direct 
method. 

Rick Szeliski: Regarding Andrew Zisserman’s point that you can’t easily do 
something like the fundamental matrix with direct measurements. I’ve always 
thought that the fundamental matrix is kind of a strange beast. It’s really only 
a way to get the camera matrices. If you formulate the problem as plane plus 
parallax then what Michal Irani showed, and what we and the people at Sarnoff 
did back in 1994, is that the direct method will pop out the camera matrix fairly 
easily. Maybe you have to try a couple of hypotheses for the epipole but after 
that gradient descent is enough. I don’t see our inability to solve a two-frame 
algebraic problem as a severe impediment to finding all of the cameras and the 
structure. 
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Andrew Zisserman: I agree that we compute the fundamental matrix in order 
to get the cameras and then the scene reconstruction. We would claim that, at 
the moment, feature-based methods can deal with much more severe distortions 
between the two views than direct methods. The reason is that direct methods 
really have to model to some extent the photometric and geometric distortions 
(such as the global homography in the plane plus parallax approach). While 
feature based methods do not require such an accurate model. 

Michal Irani: Why do you say that? I showed the multisensor fusion example, 
which doesn’t assume photometric invariance. The claim that direct methods 
require the constant brightness assumption is just a red herring. 

Philip Torr: To reply to Rick Szeliski’s point, there’s no guarantee that the 
direct method will converge, especially when the camera movement is large. 
On the other hand a feature based method can be used to get a good initial 
approximation to a solution, and then additional image information can be used 
after that. The main thesis is just that: initialize with a feature based method, 
and refine the results later with a direct one if necessary. Also, features are not 
only points: we could use lines or curves as well. 

Joss Knight: In both cases, if there are large differences it seems that you need 
to have some pre-knowledge to decide what image pre-processing technique is 
needed to get alignable images or patches. 

Michal Irani: The multi-sensor correlation method that I showed at the end is 
actually quite general. You can apply it to any kind of images. Assumptions like 
brightness constancy are less expensive, but they are not necessary. 

Philip Torr: Just to add to the controversy, I note that Jitendra Malik has 
always argued strongly against features, but when he was doing the car moni- 
toring project he ended up resorting to feature based methods. I’m interested in 
his experience. 

Jitendra Malik: From a scientific or aesthetic point of view, I don’t think 
there’s any comparison. The world consists of surfaces, and the visual world 
consists of the perception of surfaces, with occlusion, non-rigid motion, etc. 

For engineering purposes, if it so happens that there’s just a single moving 
camera and you can reduce your world to a collection of 20 or 30 points, that’s 
fine. But it’s engineering, not fundamental vision. Even then, when I look at 
Andrew Zisserman’s or Luc van Gool’s demos I always find a very large number of 
windows, doors and such like. That’s fine too, but there’s an empirical question: 
Suppose I picked lots of video tapes at random, how often would features give me 
a better initialization, and how often would direct be better? - It’s an empirical 
question and one should ask it. 

P. Anandan: That reminds me that so far, we have been talking as if structure 
from motion was the only thing worth doing. But there are a lot of other things 
like tracking walking people and motion segmentaion. Michael Black and others 
have had a fair bit of success extending models like the rigid affine one to deal 
with these situations, applying direct brightness-based methods to compute the 
parameters. Video is not like multisensor imagery - simple motion models are 
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enough, and even if the brightness varies, simple preprocessing actually does the 
trick most of the time. 

One of the things which initiated this debate in my mind is that back in 
91-93 Keith Hanna published a couple of good papers on direct methods for 
doing structure from motion, which I suspect that most of the features people 
in the audience have not read. 

Philip Torr: On the other hand, Andrew Blake’s person tracking work is feature 
based. 

P. Anandan: Yes. All I’m saying is that the applicability of direct methods 
goes farther than you think. 

Harpreet Sawhney: Phil Torr suggested that feature-based methods should be 
used to initialize direct ones. Actually, I don’t think that I’ve ever done that. If 
you need to compute dense displacement fields between nearby frames, you have 
the choice of a purely 2D method, or one that applies 3D rigidity constraints like 
Keith Hanna’s. In my experience the 3D methods give you much more accurate 
flow fields. 
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Abstract. This paper is a survey of the theory and methods of photogrammetric 
bundle adjustment, aimed at potential implementors in the computer vision commu- 
nity. Bundle adjustment is the problem of refining a visual reconstruction to produce 
jointly optimal structure and viewing parameter estimates. Topics covered include: 
the choice of cost function and robustness ; numerical optimization including sparse 
Newton methods, linearly convergent approximations, updating and recursive meth- 
ods; gauge (datum) invariance; and quality control. The theory is developed for 
general robust cost functions rather than restricting attention to traditional nonlinear 
least squares. 

Keywords: Bundle Adjustment, Scene Reconstruction, Gauge Freedom, Sparse Ma- 
trices, Optimization. 



1 Introduction 

This paper is a survey of the theory and methods of bundle adjustment aimed at the computer 
vision community, and more especially at potential implementors who already know a little 
about bundle methods. Most of the results appeared long ago in the photogrammetry and 
geodesy literatures, but many seem to be little known in vision, where they are gradually 
being reinvented. By providing an accessible modern synthesis, we hope to forestall some 
of this duplication of effort, correct some common misconceptions, and speed progress in 
visual reconstruction by promoting interaction between the vision and photogrammetry 
communities. 

Bundle adjustment is the problem of refining a visual reconstruction to produce jointly 
optimal 3D structure and viewing parameter (camera pose and/or calibration) estimates. 
Optimal means that the parameter estimates are found by minimizing some cost function 
that quantifies the model fitting error, and jointly that the solution is simultaneously optimal 
with respect to both structure and camera variations. The name refers to the ‘bundles’ 
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of light rays leaving each 3D feature and converging on each camera centre, which are 
‘adjusted’ optimally with respect to both feature and camera positions. Equivalently — 
unlike independent model methods, which merge partial reconstructions without updating 
their internal strncture — all of the structure and camera parameters are adjnsted together 
‘in one bundle’ . 

Bundle adjustment is really just a large sparse geometric parameter estimation problem, 
the parameters being the combined 3D featnre coordinates, camera poses and calibrations. 
Almost everything that we will say can be applied to many similar estimation problems in 
vision, photogrammetry, indnstrial metrology, surveying and geodesy. Adjustment com- 
putations are a major common theme throughout the measurement sciences, and once the 
basic theory and methods are understood, they are easy to adapt to a wide variety of prob- 
lems. Adaptation is largely a matter of choosing a numerical optimization scheme that 
exploits the problem structure and sparsity. We will consider several such schemes below 
for bundle adjustment. 

Classically, bundle adjustment and similar adjustment computations are formulated 
as nonlinear least squares problems [19, 46, 100, 21, 22, 69, 5, 73, 109]. The cost function 
is assumed to be quadratic in the feature reprojection errors, and robustness is provided 
by explicit outlier screening. Although it is already very flexible, this model is not really 
general enough. Modern systems often use non-quadratic M-estimator-like distributional 
models to handle outliers more integrally, and many include additional penalties related to 
overfitting, model selection and system performance (priors, MDL). For this reason, we 
will not assume a least squares / quadratic cost model. Instead, the cost will be modelled 
as a sum of opaque contributions from the independent information sonrces (individual 
observations, prior distributions, overfitting penalties . . . ). The functional forms of these 
contributions and their dependence on fixed quantities such as observations will nsually be 
left implicit. This allows many different types of robust and non-robust cost contributions to 
be incorporated, without unduly cluttering the notation or hiding essential model strncture. 
It fits well with modern sparse optimization methods (cost contributions are usually sparse 
functions of the parameters) and object-centred software organization, and it avoids many 
tedious displays of chain-rule results. Implementors are assnmed to be capable of choosing 
appropriate fnnctions and calculating derivatives themselves. 

One aim of this paper is to correct a number of misconceptions that seem to be common 
in the vision literature : 

• “Optimization / bundle adjustment is slow” : Such statements often appear in papers 
introducing yet another heuristic Structure from Motion (SFM) iteration. The claimed 
slowness is almost always due to the unthinking use of a general-purpose optimiza- 
tion routine that completely ignores the problem structure and sparseness. Real bundle 
rontines are much more efficient than this, and usually considerably more efficient and 
flexible than the newly suggested method (§6, 7). That is why bundle adjustment re- 
mains the dominant structure refinement technique for real applications, after 40 years 
of research. 

• “Only linear algebra is required” : This is a recent variant of the above, presumably 
meant to imply that the new technique is especially simple. Virtually all iterative refine- 
ment techniques use only linear algebra, and bundle adjnstment is simpler than many 
in that it only solves linear systems: it makes no use of eigen-decomposition or SVD, 
which are themselves complex iterative methods. 
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• “Any sequence can be used” : Many vision workers seem to be very resistant to the idea 
that reconstruction problems should he planned in advance (§11), and results checked 
afterwards to verify their reliability (§10). System builders should at least he aware of the 
basic techniques for this, even if application constraints make it difficult to use them. 
The extraordinary extent to which weak geometry and lack of redundancy can mask 
gross errors is too seldom appreciated, c.f. [34, 50, 30, 33]. 

• “Point P is reconstructed accurately” : In reconstruction, just as there are no absolute 
references for position, there are none for uncertainty. The 3D coordinate frame is 
itself uncertain, as it can only he located relative to uncertain reconstructed features or 
cameras. All other feature and camera uncertainties are expressed relative to the frame 
and inherit its uncertainty, so statements about them are meaningless until the frame 
and its uncertainty are specified. Covariances can look completely different in different 
frames, particularly in object-centred versus camera-centred ones. See §9. 

There is a tendency in vision to develop a profusion of ad hoc adjustment iterations. Why 
should you use bundle adjustment rather than one of these methods? ; 

• Flexibility: Bundle adjustment gracefully handles a very wide variety of different 3D 
feature and camera types (points, lines, curves, surfaces, exotic cameras), scene types 
(including dynamic and articulated models, scene constraints), information sources (2D 
features, intensities, 3D information, priors) and error models (including robust ones). 
It has no problems with missing data. 

• Accuracy : Bundle adjustment gives precise and easily interpreted results because it uses 
accurate statistical error models and supports a sound, well-developed quality control 
methodology. 

• Efficiency: Mature bundle algorithms are comparatively efficient even on very large 
problems. They use economical and rapidly convergent numerical methods and make 
near-optimal use of problem sparseness. 

In general, as computer vision reconstruction technology matures, we expect that bundle 
adjustment will predominate over alternative adjustment methods in much the same way as 
it has in photogrammetry. We see this as an inevitable consequence of a greater appreciation 
of optimization (notably, more effective use of problem structure and sparseness), and of 
systems issues such as quality control and network design. 

Coverage : We will touch on a good many aspects of bundle methods. We start by consid- 
ering the camera projection model and the parametrization of the bundle problem §2, and 
the choice of error metric or cost function §3. §4 gives a rapid sketch of the optimization 
theory we will use. §5 discusses the network structure (parameter interactions and char- 
acteristic sparseness) of the bundle problem. The following three sections consider three 
types of implementation strategies for adjustment computations: §6 covers second order 
Newton-like methods, which are still the most often used adjustment algorithms ; §7 covers 
methods with only first order convergence (most of the ad hoc methods are in this class) ; 
and §8 discusses solution updating strategies and recursive filtering bundle methods. §9 
returns to the theoretical issue of gauge freedom (datum deficiency), including the theory 
of inner constraints. §10 goes into some detail on quality control methods for monitoring 
the accuracy and reliability of the parameter estimates. § 1 1 gives some brief hints on net- 
work design, i.e. how to place your shots to ensure accurate, reliable reconstruction. §12 
completes the body of the paper by summarizing the main conclusions and giving some 
provisional recommendations for methods. There are also several appendices. §A gives a 
brief historical overview of the development of bundle methods, with literature references. 
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§B gives some technical details of matrix factorization, updating and covariance calcula- 
tion methods. §C gives some hints on designing bundle software, and pointers to useful 
resources on the Internet. The paper ends with a glossary and references. 

General references: Cultural differences sometimes make it difficult for vision workers 
to read the photogrammetry literature. The collection edited by Atkinson [5] and the 
manual by Karara [69] are both relatively accessible introductions to close-range (rather 
than aerial) photogrammetry. Other accessible tutorial papers include [46,21,22]. Kraus 
[73] is probably the most widely used photogrammetry textbook. Brown’s early survey 
of bundle methods [19] is well worth reading. The often-cited manual edited by Slama 
[100] is now quite dated, although its presentation of bundle adjustment is still relevant. 
Wolf & Ghiliani [109] is a text devoted to adjustment computations, with an emphasis 
on surveying. Hartley & Zisserman [62] is an excellent recent textbook covering vision 
geometry from a computer vision viewpoint. For nonlinear optimization, Fletcher [29] 
and Gill et al [42] are the traditional texts, and Nocedal & Wright [93] is a good modern 
introduction. For linear least squares, Bjorck [1 1] is superlative, and Lawson & Hanson is 
a good older text. For more general numerical linear algebra, Golub & Van Loan [44] is 
the standard. Duff et al [26] and George & Liu [40] are the standard texts on sparse matrix 
techniques. We will not discuss initialization methods for bundle adjustment in detail, but 
appropriate reconstruction methods are plentiful and well-known in the vision community. 
See, e.g., [62] for references. 



Notation: The structure, cameras, etc., being estimated will be parametrized by a single 
large state vector X. In general the state belongs to a nonlinear manifold, but we linearize 
this locally and work with small linear state displacements denoted ^X. Observations {e.g. 
measured image features) are denoted Z. The corresponding predicted values at parameter 
value X are denoted z = z(x), with residual prediction error Az(x) = z—z(x). However, 
observations and prediction errors usually only appear implicitly, through their influence 
on the cost function f(x) = f(pre(iz(x)). The cost function’s gradient is g = and 

its Hessian is H = ^ . The observation-state Jacobian is J = ^ . The dimensions of 
Sz are rix, riz. 



2 Projection Model and Problem Parametrization 

2.1 The Projection Model 

We begin the development of bundle adjustment by considering the basic image projection 
model and the issue of problem parametrization. Visual reconstruction attempts to recover a 
model of a 3D scene from multiple images. As part of this, it usually also recovers the poses 
(positions and orientations) of the cameras that took the images, and information about their 
internal parameters. A simple scene model might be a collection of isolated 3D features, 
e.g., points, lines, planes, curves, or surface patches. However, far more complicated scene 
models are possible, involving, e.g., complex objects linked by constraints or articulations, 
photometry as well as geometry, dynamics, etc. One of the great strengths of adjustment 
computations — and one reason for thinking that they have a considerable future in vision 
— is their ability to take such complex and heterogeneous models in their stride. Almost 
any predictive parametric model can be handled, i.e. any model that predicts the values 
of some known measurements or descriptors on the basis of some continuous parametric 
representation of the world, which is to be estimated from the measurements. 
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Similarly, many possible camera models exist. Perspective projection is the standard, 
but the affine and orthographic projections are sometimes useful for distant cameras, and 
more exotic models such as push-broom and rational polynomial cameras are needed for 
certain applications [56,63]. In addition to pose (position and orientation), and simple 
Internal parameters such as focal length and principal point, real cameras also require vari- 
ous types of additional parameters to model internal aberrations such as radial distortion 
[17-19,100,69,5]. 

For simplicity, suppose that the scene is modelled by individual static 3D features Xp, 
p = 1 . . . n, imaged in m shots with camera pose and internal calibration parameters P^, 
i = 1 . . .m. There may also be further calibration parameters C^, c = 1 ... A:, constant 
across several Images (e.g., depending on which of several cameras was used). We are 
given uncertain measurements of some subset of the possible image features Xip (the 
true image of feature Xp in image i). For each observation x^p, we assume that we have 
a predictive model Xip = Jt(Cc,Pi,Xp) based on the parameters, that can be used to 
derive a feature prediction error: 

Z)o:ip(Cc,P*,Xp) = ^,p -r(Cc,P*,Xp) (1) 

In the case of image observations the predictive model is image projection, but other 
observation types such as 3D measurements can also be included. 

To estimate the unknown 3D feature and camera parameters from the observations, 
and hence reconstruct the scene, we minimize some measure (discussed in §3) of their total 
prediction error. Bundle adjustment is the model refinement part of this, starting from given 
initial parameter estimates (e.g., from some approximate reconstruction method). Hence, 
it is essentially a matter of optimizing a complicated nonlinear cost function (the total 
prediction error) over a large nonlinear parameter space (the scene and camera parameters). 

We will not go into the analytical forms of the various possible feature and image 
projection models, as these do not affect the general structure of the adjustment network, 
and only tend to obscure its central simplicity. We simply stress that the bundle framework 
is flexible enough to handle almost any desired model. Indeed, there are so many different 
combinations of features, image projections and measurements, that it is best to regard 
them as black boxes, capable of giving measurement predictions based on their current 
parameters. (For optimization, first, and possibly second, derivatives with respect to the 
parameters are also needed). 

For much of the paper we will take quite an abstract view of this situation, collecting the 
scene and camera parameters to be estimated into a large state vector X, and representing 
the cost (total fitting error) as an abstract function f(x). The cost is really a function of 
the feature prediction errors Axip = — x(Cc, P^, Xp). But as the observations are 

constants during an adjustment calculation, we leave the cost’s dependence on them and 
on the projection model x( ) implicit, and display only its dependence on the parameters 
X actually being adjusted. 



2.2 Bundle Parametrization 

The bundle adjustment parameter space is generally a high-dimensional nonlinear manifold 
— a large Cartesian product of projective 3D feature, 3D rotation, and camera calibration 
manifolds, perhaps with nonlinear constraints, etc. The state X is not strictly speaking a 
vector, but rather a point in this space. Depending on how the entities that it contains are 




Bundle Adjustment — A Modem Synthesis 



303 



Fig. 1. Vision geometry and its error model are essentially 
projective. Affine parametrization introduces an artificial 
singularity at projective infinity, which may cause numer- 
ical problems for distant features. 

represented, X can be subject to various types of complications including singularities, 
internal constraints, and unwanted internal degrees of freedom. These arise because geo- 
metric entities like rotations, 3D lines and even projective points and planes, do not have 
simple global parametrizations. Their local parametrizations are nonlinear, with singular- 
ities that prevent them from covering the whole parameter space uniformly (e.g. the many 
variants on Euler angles for rotations, the singularity of affine point coordinates at infinity). 
And their global parametrizations either have constraints (e.g. quaternions with ||^|p = 1), 
or unwanted internal degrees of freedom (e.g. homogeneous projective quantities have a 
scale factor freedom, two points defining a line can slide along the line). For more compli- 
cated compound entities such as matching tensors and assemblies of 3D features linked by 
coincidence, parallelism or orthogonality constraints, parametrization becomes even more 
delicate. 

Although they are in principle equivalent, different parametrizations often have pro- 
foundly different numerical behaviours which greatly affect the speed and reliability of the 
adjustment iteration. The most suitable parametrizations for optimization are as uniform, 
finite and well-behaved as possible near the current state estimate. Ideally, they should 
be locally close to linear in terms of their effect on the chosen error model, so that the 
cost function is locally nearly quadratic. Nonlinearity hinders convergence by reducing 
the accuracy of the second order cost model used to predict state updates (§6). Excessive 
correlations and parametrization singularities cause ill-conditioning and erratic numerical 
behaviour. Large or infinite parameter values can only be reached after excessively many 
finite adjustment steps. 

Any given parametrization will usually only be well-behaved in this sense over a rela- 
tively small section of state space. So to guarantee uniformly good performance, however 
the state itself may be represented, state updates should be evaluated using a stable local 
parametrization based on increments from the current estimate. As examples we consider 
3D points and rotations. 

3D points : Even for calibrated cameras, vision geometry and visual reconstructions are 
intrinsically projective. If a 3D (XY Zy parametrization (or equivalently a homogeneous 
affine (X Y Z 1)^ one) is used for very distant 3D points, large X, Y, Z displacements 
are needed to change the image significantly. I.e., in (X Y Z) space the cost function 
becomes very flat and steps needed for cost adjustment become very large for distant 
points. In comparison, with a homogeneous projective parametrization (X Y Z Wy, the 
behaviour near infinity is natural, finite and well-conditioned so long as the normalization 
keeps the homogeneous 4- vector finite at infinity (by sending W ^ 0 there). In fact, 
there is no immediate visual distinction between the images of real points near infinity 
and virtual ones ‘beyond’ it (all camera geometries admit such virtual points as bona fide 
projective constructs). The optimal reconstruction of a real 3D point may even be virtual 
in this sense, if image noise happens to push it ‘across infinity’. Also, there is nothing to 
stop a reconstructed point wandering beyond infinity and back during the optimization. 
This sounds bizarre at first, but it is an inescapable consequence of the fact that the nat- 
ural geometry and error model for visual reconstruction is projective rather than affine. 
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Projectively, infinity is just like any other place. Affine parametrization {X Y Z 1)^ is 
acceptable for points near the origin with close-range convergent camera geometries, but 
it is disastrous for distant ones because it artificially cuts away half of the natural parameter 
space, and hides the fact by sending the resulting edge to infinite parameter values. Instead, 
you should use a homogeneous parametrization {X Y Z Wfi for distant points, e.g. with 
spherical normalization '^Xf = 1. 

Rotations: Similarly, experience suggests that quasi-global 3 parameter rotation para- 
metrizations such as Euler angles cause numerical problems unless one can be certain to 
avoid their singularities and regions of uneven coverage. Rotations should be parametrized 
using either quaternions subject to ||^|p = 1, or local perturbations R5R or dRR of 
an existing rotation R, where SR can be any well-behaved 3 parameter small rotation 
approximation, e.g. SR = {I + [ ]^ ), the Rodriguez formula, local Euler angles, etc. 

State updates: Just as state vectors X represent points in some nonlinear space, state 
updates X — X -f represent displacements in this nonlinear space that often can not 
be represented exactly by vector addition. Nevertheless, we assume that we can locally 
linearize the state manifold, locally resolving any internal constraints and freedoms that 
it may be subject to, to produce an unconstrained vector ^X parametrizing the possible 
local state displacements. We can then, e.g., use Taylor expansion in Sx to form a local 

cost model f(x -f ^x) « f(x) -f ^ Sx, from which we can estimate the 

state update Sx that optimizes this model (§4). The displacement ^X need not have the 
same structure or representation as X — indeed, if a well-behaved local parametrization is 
used to represent Sx, it generally will not have — but we must at least be able to update 
the state with the displacement to produce a new state estimate. We write this operation 
as X — > X + ^X, even though it may involve considerably more than vector addition. For 
example, apart from the change of representation, an updated quaternion q q + dq will 
need to have its normalization ||^|p = 1 corrected, and a small rotation update of the form 
R^R{1 + [r]^) will not in general give an exact rotation matrix. 

3 Error Modelling 

We now turn to the choice of the cost function f(x), which quantifies the total prediction 
(image reprojection) error of the model parametrized by the combined scene and camera 
parameters X. Our main conclusion will be that robust statistically-based error metrics 
based on total (inlier + outlier) log likelihoods should be used, to correctly allow for the 
presence of outliers. We will argue this at some length as it seems to be poorly understood. 
The traditional treatments of adjustment methods consider only least squares (albeit with 
data trimming for robustness), and most discussions of robust statistics give the impression 
that the choice of robustifier or M-estimator is wholly a matter of personal whim rather 
than data statistics. 

Bundle adjustment is essentially a parameter estimation problem. Any parameter es- 
timation paradigm could be used, but we will consider only optimal point estimators, 
whose output is by definition the single parameter vector that minimizes a predefined cost 
function designed to measure how well the model fits the observations and background 
knowledge. This framework covers many practical estimators including maximum likeli- 
hood (ML) and maximum a posteriori (MAP), but not explicit Bayesian model averaging. 
Robustification, regularization and model selection terms are easily incorporated in the 
cost. 
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A typical ML cost function would be the summed negative log likelihoods of the 
prediction errors of all the observed image features. For Gaussian error distributions, 
this reduces to the sum of squared covariance-weighted prediction errors (§3.2). A MAP 
estimator would typically add cost terms giving certain structure or camera calibration 
parameters a bias towards their expected values. 

The cost function is also a tool for statistical interpretation. To the extent that lower 
costs are uniformly ‘better’, it provides a natural model preference ordering, so that cost 
iso-surfaces above the minimum define natural conhdence regions. Locally, these regions 
are nested ellipsoids centred on the cost minimum, with size and shape characterized by 

d2f 

the dispersion matrix (the inverse of the cost function Hessian H = at the minimum). 

Also, the residual cost at the minimum can be used as a test statistic for model validity 
(§10). E.g., for a negative log likelihood cost model with Gaussian error distributions, 
twice the residual is a variable. 

3.1 Desiderata for the Cost Function 

In adjustment computations we go to considerable lengths to optimize a large nonlinear cost 
model, so it seems reasonable to require that the rehnement should actually improve the 
estimates in some objective (albeit statistical) sense. Heuristicahy motivated cost functions 
can not usually guarantee this. They almost always lead to biased parameter estimates, and 
often severely biased ones. A large body of statistical theory points to maximum likelihood 
(ML) and its Bayesian cousin maximum a posteriori (MAP) as the estimators of choice. 
ML simply selects the model for which the total probability of the observed data is highest, 
or saying the same thing in different words, for which the total posterior probability of the 
model given the observations is highest. MAP adds a prior term representing background 
information. ML could just as easily have included the prior as an additional ‘observation’ : 
so far as estimation is concerned, the distinction between ML / MAP and prior / observation 
is purely terminological. 

Information usually comes from many independent sources. In bundle adjustment 
these include : covariance- weighted reprojection errors of individual image features ; other 
measurements such as 3D positions of control points, GPS or inertial sensor readings; 
predictions from uncertain dynamical models (for ‘Kalman hltering’ of dynamic cameras 
or scenes); prior knowledge expressed as soft constraints ie.g. on camera calibration or 
pose values) ; and supplementary sources such as overhtting, regularization or description 
length penalties. Note the variety. One of the great strengths of adjustment computations is 
their ability to combine information from disparate sources. Assuming that the sources are 
statistically independent of one another given the model, the total probability for the model 
given the combined data is the product of the probabilities from the individual sources. To 
get an additive cost function we take logs, so the total log likelihood for the model given 
the combined data is the sum of the individual source log likelihoods. 

Properties of ML estimators : Apart from their obvious simplicity and intuitive appeal, 
ML and MAP estimators have strong statistical properties. Many of the most notable ones 
are asymptotic, i.e. they apply in the limit of a large number of independent measurements, 
or more precisely in the central limit where the posterior distribution becomes effectively 
Gaussian'. In particular: 

* Cost is additive, so as measurements of the same type are added the entire cost surface grows in 
direct proportion to the amount of data riz. This means that the relative sizes of the cost and all of 
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• Under mild regularity conditions on the observation distributions, the posterior distri- 
bution of the ML estimate converges asymptotically in probability to a Gaussian with 
covariance equal to the dispersion matrix. 

• The ML estimate asymptotically has zero bias and the lowest variance that any unbiased 
estimator can have. So in this sense, ML estimation is at least as good as any other 
method^. 

Non-asymptotically, the dispersion is not necessarily a good approximation for the 
covariance of the ML estimator. The asymptotic limit is usually assumed to be a valid 
for well-designed highly-redundant photogrammetric measurement networks, but recent 
sampling-based empirical studies of posterior likelihood surfaces [35, 80, 68] suggest that 
the case is much less clear for small vision geometry problems and weaker networks. More 
work is needed on this. 

The effect of incorrect error models : It is clear that incorrect modelling of the observation 
distributions is likely to disturb the ML estimate. Such mismodelling is to some extent 
inevitable because error distributions stand for influences that we can not fully predict or 
control. To understand the distortions that unrealistic error models can cause, first realize 
that geometric fitting is really a special case of parametric probability density estimation. 
For each set of parameter values, the geometric image projection model and the assumed 
observation error models combine to predict a probability density for the observations. 
Maximizing the likelihood corresponds to fitting this predicted observation density to the 
observed data. The geometry and camera model only enter indirectly, via their influence 
on the predicted distributions. 

Accurate noise modelling is just as critical to successful estimation as accurate ge- 
ometric modelling. The most important mismodelling is failure to take account of the 
possibility of outliers (aberrant data values, caused e.g., by blunders such as incorrect 
feature correspondences). We stress that so long as the assumed error distributions model 
the behaviour of all of the data used in the fit (including both inliers and outliers), the 
above properties of ML estimation including asymptotic minimum variance remain valid 
in the presence of outliers. In other words, ML estimation is naturally robust : there is no 

its derivatives — and hence the size r of the region around the minimum over which the second 
order Taylor terms dominate all higher order ones — remain roughly constant as riz increases. 
Within this region, the total cost is roughly quadratic, so if the cost function was taken to be the 
posterior log likelihood, the posterior distribution is roughly Gaussian. However the curvature of 
the quadratic (i.e. the inverse dispersion matrix) increases as data is added, so the posterior standard 
deviation shrinks as 0{a — rix), where 0{a) characterizes the average standard deviation 

from a single observation. For riz — Ux S> {a jr'f' , essentially the entire posterior probability 
mass lies inside the quadratic region, so the posterior distribution converges asymptotically in 
probability to a Gaussian. This happens at any proper isolated cost minimum at which second 
order Taylor expansion is locally valid. The approximation gets better with more data (stronger 
curvature) and smaller higher order Taylor terms. 

^ This result follows from the Cramer-Rao bound (e.g. [23]), which says that the covariance of any 
unbiased estimator is bounded below by the Fisher information or mean curvature of the posterior 

log likelihood surface ((x — x)(x — x)^) ^ ') "'here pis the posterior probability, X the 

parameters being estimated, X the estimate given by any unbiased estimator, X the tme underlying 
X value, and A L B denotes positive semidefiniteness of A — B. Asymptotically, the posterior 
distribution becomes Gaussian and the Fisher information converges to the inverse dispersion (the 
curvature of the posterior log likelihood surface at the cost minimum), so the ML estimate attains 
the Cramer-Rao bound. 
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1000 Samples from a Cauchy and a Gaussian Distribution 




Fig. 2. Beware of treating any bell-shaped observation distribution as a Gaussian. Despite being 
narrower in the peak and broader in the tails, the probability density function of a Cauchy distribution, 
p{x) = (7t( 1 -I- x^)) , does not look so very different from that of a Gaussian {top left). But their 
negative log likelihoods are very different (bottom left), and large deviations (“outliers”) are much 
more probable for Cauchy variates than for Gaussian ones (right). In fact, the Cauchy distribution 
has infinite covariance. 

need to robustify it so long as realistic error distributions were used in the first place. A 
distribution that models both inliers and outliers is called a total distribution. There is no 
need to separate the two classes, as ML estimation does not care about the distinction. If the 
total distribution happens to be an explicit mixture of an inlier and an outlier distribution 
(e.g., a Gaussian with a locally uniform background of outliers), outliers can be labeled 
after fitting using likelihood ratio tests, but this is in no way essential to the estimation 
process. 

It is also important to realize the extent to which superficially similar distributions can 
differ from a Gaussian, or equivalently, how extraordinarily rapidly the tails of a Gaussian 
distribution fall away compared to more realistic models of real observation errors. See 
figure 2. In fact, unmodelled outliers typically have very severe effects on the fit. To see this, 
suppose that the real observations are drawn from a fixed (but perhaps unknown) underlying 
distribution po (z) . The law of large numbers says that their empirical distributions (the ob- 
served distribution of each set of samples) converge asymptotically in probability to po (?) . 
So for each model, the negative log likelihood cost sum — logpmodei(Zi|x) converges 
to — riz / Po(z) log(pmodei(z|x)) dz. Up to a model-independent constant, this is riz times 
the relative entropy or Kullback-Leibler divergence / po(z) log(po(z)/pmodei(z|x)) dz 
of the model distribution w.r.t. the true one Po(z). Hence, even if the model family does 
not include po> the ML estimate converges asymptotically to the model whose predicted 
observation distribution has minimum relative entropy w.r.t. po. (See, e.g. [96, proposition 
2.2]). It follows that ML estimates are typically very sensitive to unmodelled outliers, as 
regions which are relatively probable under po but highly improbable under the model 
make large contributions to the relative entropy. In contrast, allowing for outliers where 
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none actually occur causes relatively little distortion, as no region which is probable under 
Po will have large - log Pmodei- 

In summary, if there is a possibility of outliers, non-robust distribution models such 
as Gaussians should be replaced with more realistic long-tailed ones such as : mixtures of 
a narrow ‘inlier’ and a wide ‘outlier’ density, Cauchy or a-densities, or densities defined 
piecewise with a central peaked ‘inlier’ region surrounded by a constant ‘outlier’ region^. 
We emphasize again that poor robustness is due entirely to unrealistic distributional as- 
sumptions : the maximum likelihood framework itself is naturally robust provided that the 
total observation distribution including both inliers and outliers is modelled. In fact, real 
observations can seldom be cleanly divided into inliers and outliers. There is a hard core 
of outliers such as feature correspondence errors, but there is also a grey area of features 
that for some reason (a specularity, a shadow, poor focus, motion blur . . . ) were not as 
accurately located as other features, without clearly being outliers. 



3.2 Nonlinear Least Squares 

One of the most basic parameter estimation methods is nonlinear least squares. Suppose 
that we have vectors of observations Zj predicted by a model = Zi(x), where X is a 
vector of model parameters. Then nonlinear least squares takes as estimates the parameter 
values that minimize the weighted Sum of Squared Error (SSE) cost function : 

fW = 5 5I^z,(xrW,Az,(x), Az,(x) = A-z*(x) (2) 



Here, AZj(x) is the feature prediction error and is an arbitrary symmetric positive 
definite (SPD) weight matrix. Modulo normalization terms independent of X, the weighted 
SSE cost function coincides with the negative log likelihood for observations Zj perturbed 
by Gaussian noise of mean zero and covariance So for least squares to have a useful 
statistical interpretation, the should be chosen to approximate the inverse measurement 
covariance of Zj. Even for non-Gaussian noise with this mean and covariance, the Gauss- 
Markov theorem [37, 1 1] states that if the models Zi(x) are linear, least squares gives the 
Best Linear Unbiased Estimator (BLUE), where ‘best’ means minimum variance'^^. 

Any weighted least squares model can be converted to an unweighted one (W^ = 1 ) 
by pre-multiplying Z^, Z^, AZj by any LJ satisfying LJ. Such an can be cal- 

culated efficiently from or using Cholesky decomposition (§B.l). Az, = L-Az, 
is called a standardized residual, and the resulting unweighted least squares problem 
minx 5 II AZj(x)|p is said to be in standard form. One advantage of this is that opti- 
mization methods based on linear least squares solvers can be used in place of ones based 
on linear (normal) equation solvers, which allows ill-conditioned problems to be handled 
more stably (§B.2). 

Another peculiarity of the SSE cost function is its indifference to the natural bound- 
aries between the observations. If observations Zj from any sources are assembled into a 

^ The latter case corresponds to a hard inlier / outlier decision rule : for any observation in the ‘outlier’ 
region, the density is constant so the observation has no influence at all on the fit. Similarly, the 
mixture case corresponds to a softer inlier / outlier decision rule. 

It may be possible (and even useful) to do better with either biased (towards the correct solution), 
or nonlinear estimators. 
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compound observation vector Z = (zj^, ... , Z^)^, and their weight matrices are assem- 
bled into compound block diagonal weight matrix W = diag(Wi, . . . , W^), the weighted 
squared error f(x) = | Az(x)^ W Az(x) is the same as the original SSE cost function, 
i AZj(x)^ Wi AZj(x). The general quadratic form of the SSE cost is preserved under 
such compounding, and also under arbitrary linear transformations of Z that mix compo- 
nents from different observations. The only place that the underlying structure is visible 
is in the block structure of W. Such invariances do not hold for essentially any other cost 
function, but they simplify the formulation of least squares considerably. 

3.3 Robustifled Least Squares 

The main problem with least squares is its high sensitivity to outliers. This happens because 
the Gaussian has extremely small tails compared to most real measurement error distribu- 
tions. Eor robust estimates, we must choose a more realistic likelihood model (§3.1). The 
exact functional form is less Important than the general way in which the expected types 
of outliers enter. A single blunder such as a correspondence error may affect one or a few 
of the observations, but it will usually leave all of the others unchanged. This locality is 
the whole basis of robustification. If we can decide which observations were affected, we 
can down-weight or eliminate them and use the remaining observations for the parameter 
estimates as usual. If all of the observations had been affected about equally (e.g. as by 
an incorrect projection model), we might still know that something was wrong, but not be 
able to fix it by simple data cleaning. 

We will adopt a ‘single layer’ robustness model, in which the observations are par- 
titioned into independent groups Z^, each group being irreducible in the sense that it is 
accepted, down-weighted or rejected as a whole, independently of all the other groups. 
The partitions should reflect the types of blunders that occur. Eor example, if feature cor- 
respondence errors are the most common blunders, the two coordinates of a single image 
point would naturally form a group as both would usually be invalidated by such a blunder, 
while no other image point would be affected. Even if one of the coordinates appeared to 
be correct, if the other were incorrect we would usually want to discard both for safety. 
On the other hand, in stereo problems, the four coordinates of each pair of corresponding 
image points might be a more natural grouping, as a point in one image is useless without 
its correspondent in the other one. 

Henceforth, when we say observation we mean irreducible group of observations 
treated as a unit by the robustifying model. I.e., our observations need not be scalars, but 
they must be units, probabilistically independent of one another irrespective of whether 
they are inliers or outliers. 

As usual, each independent observation Zj contributes an independent term f (x | Zj) to 
the total cost function. This could have more or less any form, depending on the expected 
total distribution of inliers and outliers for the observation. One very natural family are the 
radial distributions, which have negative log likelihoods of the form : 

f,(x) = ip,(Az,(x)^W,Az,(x)) (3) 

Here, pfs) can be any increasing function with Pi(0) = 0 and -^pfO) = 1. (These 

d2f . 

guarantee that at Az^ = 0, f vanishes and = W^). Weighted SSE has pi{s) = s, while 
more robust variants have sublinear pi, often tending to a constant at oo so that distant 
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outliers are entirely ignored. The dispersion matrix determines the spatial spread of Zj, 
and up to scale its covariance (if this is finite). The radial form is preserved under arbitrary 
affine transformations of Z^, so within a group, all of the observations are on an equal 
footing in the same sense as in least squares. However, non-Gaussian radial distributions 
are almost never separable: the observations in Zj can neither be split into independent 
subgroups, nor combined into larger groups, without destroying the radial form. Radial 
cost models do not have the remarkable isotropy of non-robust SSE, but this is exactly 
what we wanted, as it ensures that all observations in a group will be either left alone, or 
down-weighted together. 

As an example of this, for image features polluted with occasional large outliers caused 
by correspondence errors, we might model the error distribution as a Gaussian central peak 
plus a uniform background of outliers. This would give negative log likelihood contribu- 
tions of the form f(x) = —log (exp (— + e) instead of the non-robust weighted 
SSE model f(x) = where xfp = Wip Axip is the squared weighted residual 

error (which is a variable for a correct model and Gaussian error distribution), and e 
parametrizes the frequency of outliers. 




3.4 Intensity-Based Methods 

The above models apply not only to geometric image features, but also to intensity-based 
matching of image patches. In this case, the observables are image gray-scales or colors 
I rather than feature coordinates U, and the error model is based on intensity residuals. 
To get from a point projection model U = u(x) to an intensity based one, we simply 
compose with the assumed local intensity model I = I(u) (e.g. obtained from an image 
template or another image that we are matching against), premultiply point Jacobians by 
point-to-intensity Jacobians ^ , etc. The full range of intensity models can be implemented 
within this framework; pure translation, affine, quadratic or homographic patch deforma- 
tion models, 3D model based intensity predictions, coupled affine or spline patches for 
surface coverage, etc., [1, 52, 55, 9, 1 10, 94, 53, 97, 76, 104, 102]. The structure of intensity 
based bundle problems is very similar to that of feature based ones, so all of the techniques 
studied below can be applied. 

We will not go into more detail on intensity matching, except to note that it is the 
real basis of feature based methods. Eeature detectors are optimized for detection not 
localization. To localize a detected feature accurately we need to match (some function of) 
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the image intensities in its region against either an idealized template or another image of 
the feature, using an appropriate geometric deformation model, etc. For example, suppose 
that the intensity matching model is f(u) = | // p(||^I(u)|p) where the integration is 
over some image patch, <51 is the current intensity prediction error, U parametrizes the local 
geometry (patch translation & warping), and p(-) is some intensity error robustifier. Then 
the cost gradient in terms of U is gjj = ^ = fj ^ du ’ the cost Hessian in 

U in a Gauss-Newton approximation is Hu = ^ 1)2 ~ // P" ( du ^ du ' ^ feature based 

model, we express U = u(x) as a function of the bundle parameters, so if Ju = ^ we have 
a corresponding cost gradient and Hessian contribution gj = Qi Ju and Hx = Jj Hu Ju- 
In other words, the intensity matching model is locally equivalent to a quadratic feature 
matching one on the ‘features’ u(x), with effective weight (inverse covariance) matrix 
Wu = Hu- All image feature error models in vision are ultimately based on such an 
underlying intensity matching model. As feature covariances are a function of intensity 
gradients Jf p" they can be both highly variable between features (depending 

on how much local gradient there is), and highly anisotropic (depending on how directional 
the gradients are). E.g., for points along a ID intensity edge, the uncertainty is large in the 
along edge direction and small in the across edge one. 

3.5 Implicit Models 

Sometimes observations are most naturally expressed in terms of an implicit observation- 
constraining model h(x, z) = 0, rather than an explicit observation-predicting one Z = 
z(x). (The associated image error still has the form f(z — z)). For example, if the model 
is a 3D curve and we observe points on it (the noisy images of 3D points that may lie 
anywhere along the 3D curve), we can predict the whole image curve, but not the exact 
position of each observation along it. We only have the constraint that the noiseless image 
of the observed point would lie on the noiseless image of the curve, if we knew these. There 
are basically two ways to handle implicit models : nuisance parameters and reduction. 
Nuisance parameters : In this approach, the model is made explicit by adding additional 
‘nuisance’ parameters representing something equivalent to model-consistent estimates 
of the unknown noise free observations, i.e. to Z with h(x, z) = 0. The most direct way 
to do this is to include the entire parameter vector Z as nuisance parameters, so that we 
have to solve a constrained optimization problem on the extended parameter space (x, z), 
minimizing f(z — z) over (x,z) subject to h(x,z) = 0. This is a sparse constrained 
problem, which can be solved efficiently using sparse matrix techniques (§6.3). In fact, 
for image observations, the subproblems in Z (optimizing f(z — z) over Z for hxed Z 
and x) are small and for typical f rather simple. So in spite of the extra parameters Z, 
optimizing this model is not significantly more expensive than optimizing an explicit one 
Z = z(x) [14, 13, 105, 106]. For example, when estimating matching constraints between 
image pairs or triplets [60,62], instead of using an explicit 3D representation, pairs or 
triplets of corresponding image points can be used as features Z^, subject to the epipolar 
or trifocal geometry contained in X [105, 106]. 

However, if a smaller nuisance parameter vector than Z can be found, it is wise to use 
it. In the case of a curve, it suffices to include just one nuisance parameter per observation, 
saying where along the curve the corresponding noise free observation is predicted to 
lie. This model exactly satishes the constraints, so it converts the implicit model to an 
unconstrained explicit one Z = z(x, A), where A are the along-curve nuisance parameters. 
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The advantage of the nuisance parameter approach is that it gives the exact optimal 
parameter estimate for X, and jointly, optimal X-consistent estimates for the noise free 
observations Z. 

Reduction: Alternatively, we can regard h(x, z) rather than Z as the observation vector, 
and hence fit the parameters to the explicit log likelihood model for h(x,z). To do this, 
we must transfer the underlying error model / distribution f(Az) on Z to one f(h) on 
h(x,z). In principle, this should be done by marginalization: the density for h is given 
by integrating that for Az over all Az giving the same h. Within the point estimation 
framework, it can be approximated by replacing the integration with maximization. Neither 
calculation is easy in general, but in the asymptotic limit where first order Taylor expansion 
h(x, z) = h(x, Z + Az) ~ 0 + ^ Az is valid, the distribution of h is a marginalization or 
maximization of that of Az over affine subspaces. This can be evaluated in closed form for 
some robust distributions. Also, standard covariance propagation gives (more precisely, 
this applies to the h and Az dispersions) : 

(h(x,z))«0, (h(x,z)h(x,z)-) « g(AzAz")|" = gw-f" (4) 

where is the covariance of Az. So at least for an outlier-free Gaussian model, the 
reduced distribution remains Gaussian (albeit with X-dependent covariance). 



4 Basic Numerical Optimization 

Having chosen a suitable model quality metric, we must optimize it. This section gives a 
very rapid sketch of the basic local optimization methods for differentiable functions. See 
[29, 93, 42] for more details. We need to minimize a cost function f(x) over parameters X, 
starting from some given initial estimate X of the minimum, presumably supplied by some 
approximate visual reconstruction method or prior knowledge of the approximate situation. 
As in §2.2, the parameter space may be nonlinear, but we assume that local displacements 
can be parametrized by a local coordinate system / vector of free parameters Sx. We try 
to find a displacement X — ^ X + ^X that locally minimizes or at least reduces the cost 
function. Real cost functions are too complicated to minimize in closed form, so instead 
we minimize an approximate local model for the function, e.g. based on Taylor expansion 
or some other approximation at the current point X. Although this does not usually give the 
exact minimum, with luck it will improve on the initial parameter estimate and allow us to 
iterate to convergence. The art of reliable optimization is largely in the details that make 
this happen even without luck: which local model, how to minimize it, how to ensure that 
the estimate is improved, and how to decide when convergence has occurred. If you not 
are interested in such subjects, use a professionally designed package (§C.2): details are 
important here. 

4.1 Second Order Methods 

The reference for all local models is the quadratic Taylor series one : 

f(x + ^x) « f(x) + g^^x+i^x^H<5x 9 = ^(x) H = g(x) 

quadratic local model gradient vector Hessian matrix 
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For now, assume that the Hessian H is positive definite (hut see helow and §9). The local 
model is then a simple quadratic with a unique global minimum, which can be found 
explicitly using linear algebra. Setting ^ (x + ^x) « H Jx + g to zero for the stationary 
point gives the Newton step; 

<5x = -H'g (6) 

The estimated new function value is f(x + ^x) « f(x) — |^X^ H 5x = f(x) — g. 

Iterating the Newton step gives Newton’s method. This is the canonical optimization 
method for smooth cost functions, owing to its exceptionally rapid theoretical and practical 
convergence near the minimum. For quadratic functions it converges in one iteration, and 
for more general analytic ones its asymptotic convergence is quadratic: as soon as the 
estimate gets close enough to the solution for the second order Taylor expansion to be 
reasonably accurate, the residual state error is approximately squared at each iteration. 
This means that the number of significant digits in the estimate approximately doubles at 
each iteration, so starting from any reasonable estimate, at most about log 2 ( 1 6) + 1 « 5-6 
iterations are needed for full double precision (16 digit) accuracy. Methods that potentially 
achieve such rapid asymptotic convergence are called second order methods. This is a 
high accolade for a local optimization method, but it can only be achieved if the Newton step 
is asymptotically well approximated. Despite their conceptual simplicity and asymptotic 
performance, Newton-like methods have some disadvantages : 

• To guarantee convergence, a suitable step control policy must be added (§4.2). 

• Solving the n x n Newton step equations takes time O (n^) for a dense system (§B. 1), 
which can be prohibitive for large n. Although the cost can often be reduced (very 
substantially for bundle adjustment) by exploiting sparseness in H, it remains true that 
Newton-like methods tend to have a high cost per iteration, which increases relative to 
that of other methods as the problem size increases. For this reason, it is sometimes 
worthwhile to consider more approximate first order methods (§7), which are occa- 
sionally more efficient, and generally simpler to implement, than sparse Newton-like 
methods. 

• Calculating second derivatives H is by no means trivial for a complicated cost func- 
tion, both computationally, and in terms of implementation effort. The Gauss-Newton 
method (§4.3) offers a simple analytic approximation to H for nonlinear least squares 
problems. Some other methods build up approximations to H from the way the gradient 
g changes during the iteration are in use (see §7.1, Krylov methods). 

• The asymptotic convergence of Newton-like methods is sometimes felt to be an expen- 
sive luxury when far from the minimum, especially when damping (see below) is active. 
However, it must be said that Newton-like methods generally do require significantly 
fewer iterations than first order ones, even far from the minimum. 

4.2 Step Control 

Unfortunately, Newton’s method can fail in several ways. It may converge to a saddle 
point rather than a minimum, and for large steps the second order cost prediction may be 
inaccurate, so there is no guarantee that the true cost will actually decrease. To guarantee 
convergence to a minimum, the step must follow a local descent direction (a direction 
with a non-negligible component down the local cost gradient, or if the gradient is zero 
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near a saddle point, down a negative curvature direction of the Hessian), and it must make 
reasonable progress in this direction (neither so little that the optimization runs slowly 
or stalls, nor so much that it greatly overshoots the cost minimum along this direction). 
It is also necessary to decide when the iteration has converged, and perhaps to limit any 
over-large steps that are requested. Together, these topics form the delicate subject of step 
control. 

To choose a descent direction, one can take the Newton step direction if this descends 
(it may not near a saddle point), or more generally some combination of the Newton and 
gradient directions. Damped Newton methods solve a regularized system to find the step; 

(H + AW)5x = -g (7) 

Here, A is some weighting factor and W is some positive definite weight matrix (often 
the identity, so A — oo becomes gradient descent ex — g). A can be chosen to limit 
the step to a dynamically chosen maximum size (trust region methods), or manipulated 
more heuristically, to shorten the step if the prediction is poor (Levenherg-Marquardt 
methods). 

Given a descent direction, progress along it is usually assured by a line search method, 
of which there are many based on quadratic and cubic ID cost models. If the suggested 
(e.g. Newton) step is Sx, line search finds the a that actually minimizes f along the line 
X + Of Sx, rather than simply taking the estimate a = 1. 

There is no space for further details on step control here (again, see [29, 93, 42]). How- 
ever note that poor step control can make a huge difference in reliability and convergence 
rates, especially for ill-conditioned problems. Unless you are familiar with these issues, it 
is advisable to use professionally designed methods. 



4.3 Gauss-Newton and Least Squares 

Consider the nonlinear weighted SSE cost model f(x) = 4 Az(x)^ W Az(x) (§3.2) with 
prediction error Az(x) = Z — z(x) and weight matrix W. Differentiation gives the gradient 
and Hessian in terms of the Jacobian or design matrix of the predictive model, J = ^ : 

g ^ ^ = Az-WJ H ^ = J-WJ + (8) 

d^z 

These formulae could be used directly in a damped Newton method, but the term in H 
is likely to be small in comparison to the corresponding components of W J if either: (i) 

d^z 

the prediction error Az(x) is small; or (ii) the model is nearly linear, « 0. Dropping 
the second term gives the Gauss-Newton approximation to the least squares Hessian, 
H « W J. With this approximation, the Newton step prediction equations become the 
Gauss-Newton or normal equations : 

(J^WJ)^x = -J^WAz (9) 

The Gauss-Newton approximation is extremely common in nonlinear least squares, and 
practically all current bundle implementations use it. Its main advantage is simplicity: the 
second derivatives of the projection model z(x) are complex and troublesome to implement. 
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In fact, the normal equations are just one of many methods of solving the weighted 
linear least squares problem^ min sx ^(JSx — Az)^ W (J — Az). Another notable 
method is that based on QR decomposition (§B.2, [1 1, 44]), which is up to a factor of two 
slower than the normal equations, but much less sensitive to ill-conditioning in J 

Whichever solution method is used, the main disadvantage of the Gauss-Newton ap- 
proximation is that when the discarded terms are not negligible, the convergence rate is 
greatly reduced (§7.2). In our experience, such reductions are indeed common in highly 
nonlinear problems with (at the current step) large residuals. For example, near a saddle 
point the Gauss-Newton approximation is never accurate, as its predicted Hessian is al- 
ways at least positive semidefinite. However, for well-parametrized (i.e. locally near linear, 
§2.2) bundle problems under an outlier-free least squares cost model evaluated near the cost 
minimum, the Gauss-Newton approximation is usually very accurate. Feature extraction 
errors and hence Az and have characteristic scales of at most a few pixels. In contrast, 
the nonlinearities of z(x) are caused by nonlinear 3D feature-camera geometry (perspec- 
tive effects) and nonlinear image projection (lens distortion). For typical geometries and 
lenses, neither effect varies significantly on a scale of a few pixels. So the nonlinear correc- 
tions are usually small compared to the leading order linear terms, and bundle adjustment 
behaves as a near-linear small residual problem. 

However note that this does not extend to robust cost models. Robustification works 
by introducing strong nonlinearity into the cost function at the scale of typical feature 
reprojection errors. For accurate step prediction, the optimization routine must take account 
of this. For radial cost functions (§3.3), a reasonable compromise is to take account of 
the exact second order derivatives of the robustifiers pi(-), while retaining only the first 
order Gauss-Newton approximation for the predicted observations Zi(x). If p' and p" are 
respectively the first and second derivatives of pi at the current evaluation point, we have 
a robustified Gauss-Newton approximation; 

g. = p' JJ W, Az, H, « JI (p' W, + 2 p'l (W, Az,) (W, Az,)-) J, (10) 

So robustification has two effects : (i) it down-weights the entire observation (both g^ and 
Hi) by p' ; and (ii) it makes a rank-one reduction^ of the curvature in the radial (Az^) 
direction, to account for the way in which the weight changes with the residual. There 
are reweighting-based optimization methods that include only the first effect. They still 
find the true cost minimum g = 0 as the g^ are evaluated exactly*, but convergence may 

^ Here, the dependence of J on X is ignored, which amounts to the same thing as ignoring the 
term in H. 

® The QR method gives the solution to a relative error of about 0{Ct), as compared to 0{C^e) 
for the normal equations, where C is the condition number (the ratio of the largest to the smallest 
singular value) of J, and e is the machine precision (10“^® for double precision floating point). 

’ The useful robustifiers pi are sublinear, with p) < 1 and p” < 0 in the outlier region. 

* Reweighting is also sometimes used in vision to handle projective homogeneous scale factors 
rather than error weighting. E.g., suppose that image points {ulw,v/wY are generated by a 
homogeneous projection equation {u, v, w)- = P (X, Y, Z, 1)-, where P is the 3x4 homoge- 
neous image projection matrix. A scale factor reweighting scheme might take derivatives w.r.t. 
u, V while treating the inverse weight w as a constant within each iteration. Minimizing the re- 
sulting globally bilinear linear least squares error model over P and {X, Y, ZY does not give 
the tme cost minimum: it zeros the gradient-ignoring-w-variations, not the true cost gradient. 
Such schemes should not be used for precise work as the bias can be substantial, especially for 
wide-angle lenses and close geometries. 
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be slowed owing to inaccuracy of H, especially for the mainly radial deviations produced 
by non-robust initializers containing outliers. Hi has a direction of negative curvature if 
p"AzJW,Az, < -ip', but if not we can even reduce the rohustihed Gauss-Newton 
model to a local unweighted SSE one for which linear least squares methods can be used. 
For simplicity suppose that has already reduced to 1 hy premultiplying Zi and by LJ 
where LJ ~\Ni. Then minimizing the effective squared error i \\SZi — 5x|p gives 
the correct second order robust state update, where a = RootOf(ia^ — a — p"/p' ||AzJp) 
and; 



^z, ^ ^ Az,(x) 

1 — a 



J. ^ 1 - 



a 



Az] 



(11) 



In practice, if p'' || A p', we can use the same formulae but limit a < 1 — e for 

some small e. However, the full curvature correction is not applied in this case. 



4.4 Constrained Problems 



More generally, we may want to minimize a function f(x) subject to a set of constraints 
c(x) = 0 on X. These might he scene constraints, internal consistency constraints on the 
parametrization (§2.2), or constraints arising from an implicit observation model (§3.5). 
Given an initial estimate X of the solution, we try to improve this by optimizing the 
quadratic local model for f subject to a linear local model of the constraints C. This linearly 
constrained quadratic problem has an exact solution in linear algebra. Let g, H be the 
gradient and Hessian of f as before, and let the first order expansion of the constraints be 
c(x+(5x) « c(x)+C^X where C = ^.Introduce a vector of Lagrange multipliers Afore. 

We seek the X+^X that optimizes f+C^ A subject toe = 0,i.e.0 = ^(f+e^ A)(x+<5x) « 
g + H ^x + C^ A and 0 = e(x + ^x) « e(x) + C^x. Combining these gives the Sequential 
Quadratic Programming (SQP) step: 




Q. I(x + ^x) (12) 



/H CA - H-^ D ' C H ' H-^ D A 

0 ; “ D-^CH ' -D-^ ) ’ 



D = (13) 



At the optimum ^X and e vanish, but A = — g, which is generally non-zero. 

An alternative constrained approach uses the linearized constraints to eliminate some 
of the variables, then optimizes over the rest. Suppose that we can order the variables 
to give partitions X = (Xi X2)^ and C = (Ci C2), where Ci is square and invertible. 
Then using Ci Xi + C2 X2 = Cx = — C, we can solve for Xi in terms of X2 and C: 
Xi = — CA(C 2 X 2 + c). Substituting this into the quadratic cost model has the effect of 
eliminating Xi, leaving a smaller unconstrained reduced problem H22 X2 = —92’ where: 

H22 = H22 ~ H21 C2 — C2 Hi 2 -f C2 Hii C2 (14) 

92 = 92 ~ C2 (^i^9i ~ (H21 — C2 Hii) (15) 

(These matrices can be evaluated efficiently using simple matrix factorization schemes 
[11]). This method is stable provided that the chosen Ci is well-conditioned. It works well 
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for dense problems, but is not always suitable for sparse ones because if C is dense, the 
reduced Hessian H 22 becomes dense too. 

For least squares cost models, constraints can also be handled within the linear least 
squares framework, e.g. see [11]. 



4.5 General Implementation Issues 

Before going into details, we mention a few points of good numerical practice for large- 
scale optimization problems such as bundle adjustment: 

Exploit the problem structure : Large-scale problems are almost always highly structured 
and bundle adjustment is no exception. In professional cartography and photogrammetric 
site-modelling, bundle problems with thousands of images and many tens of thousands 
of features are regularly solved. Such problems would simply be infeasible without a 
thorough exploitation of the natural structure and sparsity of the bundle problem. We will 
have much to say about sparsity below. 

Use factorization effectively: Many of above formulae contain matrix inverses. This is 
a convenient short-hand for theoretical calculations, but numerically, matrix inversion is 
almost never used. Instead, the matrix is decomposed into its Cholesky, LU, QR, etc., 
factors and these are used directly, e.g. linear systems are solved using forwards and 
backwards substitution. This is much faster and numerically more accurate than explicit 
use of the inverse, particularly for sparse matrices such as the bundle Hessian, whose 
factors are still quite sparse, but whose inverse is always dense. Explicit inversion is 
required only occasionally, e.g. for covariance estimates, and even then only a few of 
the entries may be needed {e.g. diagonal blocks of the covariance). Factorization is the 
heart of the optimization iteration, where most of the time is spent and where most can be 
done to improve efficiency (by exploiting sparsity, symmetry and other problem structure) 
and numerical stability (by pivoting and scaling). Similarly, certain matrices (subspace 
projectors. Householder matrices) have (diagonal)H-(low rank) forms which should not be 
explicitly evaluated as they can be applied more efficiently in pieces. 

Use stable local parametrizations: As discussed in §2.2, the parametrization used for 
step prediction need not coincide with the global one used to store the state estimate. It is 
more important that it should be finite, uniform and locally as nearly linear as possible. 
If the global parametrization is in some way complex, highly nonlinear, or potentially 
ill-conditioned, it is usually preferable to use a stable local parametrization based on 
perturbations of the current state for step prediction. 

Scaling and preconditioning : Another parametrization issue that has a profound and too- 
rarely recognized influence on numerical performance is variable scaling (the choice of 
‘units’ or reference scale to use for each parameter), and more generally preconditioning 
(the choice of which linear combinations of parameters to use). These represent the linear 
part of the general parametrization problem. The performance of gradient descent and most 
other linearly convergent optimization methods is critically dependent on preconditioning, 
to the extent that for large problems, they are seldom practically useful without it. 

One of the great advantages of the Newton-like methods is their theoretical indepen- 
dence of such scaling issues®. But even for these, scaling makes itself felt indirectly in 

® Under a linear change of coordinates X — >■ Tx we have g — >■ T“^ g and H — >■ T^^ H T“^, so the 
Newton step <5x = — H^'^g varies correctly as 5x — >■ T <5x, whereas the gradient one <5x ~ g 
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Fig. 3. The network graph, parameter connection graph, Jacobian structure and Hessian structure for 
a toy bundle problem with five 3D features A-E, four images l-A and two camera calibrations Ki 
(shared by images 1,2) and K 2 (shared by images 3,4). Feature A is seen in images 1,2; B in 1,2,4; 
C in 1,3; D in and E in 3,4. 

several ways : (() Step control strategies including convergence tests, maximum step size 
limitations, and damping strategies (trust region, Levenberg-Marquardt) are usually all 
based on some implicit norm ||(5x|p, and hence change under linear transformations of X 
{e.g., damping makes the step more like the non-invariant gradient descent one), {ii) Piv- 
oting strategies for factoring H are highly dependent on variable scaling, as they choose 
Targe’ elements on which to pivot. Here, Targe’ should mean ‘in which little numerical 
cancellation has occurred’ but with uneven scaling it becomes ‘with the largest scale’, (iii) 
The choice of gauge (datum, §9) may depend on variable scaling, and this can signihcantly 
influence convergence [82, 81]. 

For all of these reasons, it is important to choose variable scalings that relate mean- 
ingfully to the problem structure. This involves a judicious comparison of the relative 
influence of, e.g., a unit of error on a nearby point, a unit of error on a very distant one, 
a camera rotation error, a radial distortion error, etc. For this, it is advisable to use an 
‘ideal’ Hessian or weight matrix rather than the observed one, otherwise the scaling might 
break down if the Hessian happens to become ill-conditioned or non-positive during a few 
iterations before settling down. 



5 Network Structure 

Adjustment networks have a rich structure, illustrated in figure 3 for a toy bundle problem. 
The free parameters subdivide naturally into blocks corresponding to: 3D feature coor- 
dinates A, ... , E; camera poses and unshared (single image) calibration parameters 1, 
. . . ,4; and calibration parameters shared across several images K\,K 2 . Parameter blocks 



varies incorrectly as dx — >■ T ^ 5x. The Newton and gradient descent steps agree only when 

T^T = H. 
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interact only via their joint influence on image features and other observations, i.e. via their 
joint appearance in cost function contributions. The abstract structure of the measurement 
network can be characterized graphically by the network graph (top left), which shows 
which features are seen in which images, and the parameter connection graph (top right) 
which details the sparse structure by showing which parameter blocks have direct interac- 
tions. Blocks are linked if and only if they jointly influence at least one observation. The 
cost function Jacobian (bottom left) and Hessian (bottom right) reflect this sparse structure. 
The shaded boxes correspond to non-zero blocks of matrix entries. Each block of rows in 
the Jacobian corresponds to an observed image feature and contains contributions from 
each of the parameter blocks that influenced this observation. The Hessian contains an 
off-diagonal block for each edge of the parameter connection graph, i.e. for each pair of 
parameters that couple to at least one common feature / appear in at least one common 
cost contribution^®. 

Two layers of structure are visible in the Hessian. The primary structure consists of 
the subdivision into structure (A-E) and camera (1-4, K 1 -K 2 ) submatrices. Note that the 
structure submatrix is block diagonal: 3D features couple only to cameras, not to other 
features. (This would no longer hold if inter-feature measurements such as distances or 
angles between points were present). The camera submatrix is often also block diagonal, 
but in this example the sharing of unknown calibration parameters produces off-diagonal 
blocks. The secondary structure is the internal sparsity pattern of the structure-camera 
Hessian submatrix. This is dense for small problems where all features are seen in all 
images, but in larger problems it often becomes quite sparse because each image only sees 
a fraction of the features. 

All worthwhile bundle methods exploit at least the primary structure of the Hessian, 
and advanced methods exploit the secondary structure as well. The secondary structure is 
particularly sparse and regular in surface coverage problems such grids of photographs in 
aerial cartography. Such problems can be handled using a fixed ‘nested dissection’ variable 
reordering (§6.3). But for the more irregular connectivities of close range problems, general 
sparse factorization methods may be required to handle secondary structure. 

Bundle problems are by no means limited to the above structures. Eor example, for 
more complex scene models with moving or articulated objects, there will be additional 
connections to object pose or joint angle nodes, with linkages reflecting the kinematic 
chain structure of the scene. It is often also necessary to add constraints to the adjustment, 
e.g. coplanarity of certain points. One of the greatest advantages of the bundle technique is 
its ability to adapt to almost arbitrarily complex scene, observation and constraint models. 



The Jacobian structure can be described more directly by a bipartite graph whose nodes correspond 
on one side to the observations, and on the other to the parameter blocks that influence them. The 
parameter connection graph is then obtained by deleting each observation node and linking each 
pair of parameter nodes that it connects to. This is an example of elimination graph processing 
(see below). 
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6 Implementation Strategy 1 : Second Order Adjustment Methods 

The next three sections cover implementation strategies for optimizing the bundle adjust- 
ment cost function f(x) over the complete set of unknown structure and camera parameters 
X. This section is devoted to second-order Newton-style approaches, which are the basis 
of the great majority of current implementations. Their most notable characteristics are 
rapid (second order) asymptotic convergence but relatively high cost per iteration, with 

an emphasis on exploiting the network structure (the sparsity of the Hessian H = 
for efficiency. In fact, the optimization aspects are more or less standard (§4, [29, 93, 42]), 
so we will concentrate entirely on efficient methods for solving the linearized Newton 
step prediction equations Sx = — H^^g, (6). For now, we will assume that the Hessian 
H is non-singular. This will be amended in §9 on gauge freedom, without changing the 
conclusions reached here. 



6.1 The Schur Complement and the Reduced Bundle System 



Schur complement : Consider the following block triangular matrix factorization : 



M = 




i; d; i J ’ 



D = D-CA-'B (16) 



iA Br'- (1 -A-iB)/" A-i 0 V 1 0) _ f A-i-tA-iBD ^CA^i 

IcdJ -U 1 0 D"V'' -CA-H j - _D~^CA-i 



-A-iBD 

J 

(17) 



Here A must be square and invertible, and for (17), the whole matrix must also be square 
and invertible. D is called the Schur complement of A in M. If both A and D are invertible, 
complementing on D rather than A gives 



( 



A B ',-1 

c d) 



/ A ^ -A ^BD^i ^ 

V -D C A~^ D-i-l-D-i C A"^ B D^V ’ 



A = A-BD^C 



Equating upper left blocks gives the Woodbury formula: 



(A±BD-X)^" = A-^t A-'B(D±CA-'B)~" CA-^ (18) 



This is the usual method of updating the inverse of a nonsingular matrix A after an update 
(especially a low rank one) A — ^ A ± B . (See §8.1). 

Reduction: Now consider the linear system (qd)(x 2 ) = (b^)' Pr^-multiplying by 

( _q\-i ° ) gives (q§)(x 2 )= b 2 = b 2 — C A“^ bi. Hence we can use 

Schur complement and forward substitution to find a reduced system DX 2 = b 2 , solve 
this for X 2 , then back- substitute and solve to hnd Xi : 



D = D-CA^B 

b2 = b2 — C A ^ bi 

Schur complement H- 
forward substitution 



DX2 = b2 

reduced system 



Axi = bi - Bx2 

back-substitution 



( 19 ) 




Bundle Adjustment — A Modem Synthesis 



321 



Note that the reduced system entirely subsumes the contribution of the Xi rows and columns 
to the network. Once we have reduced, we can pretend that the problem does not involve 
Xi at all — it can be found later by back-substitution if needed, or ignored if not. This is 
the basis of all recursive filtering methods. In bundle adjustment, if we use the primary 
subdivision into feature and camera variables and subsume the structure ones, we get the 
reduced camera system Hcc Xc = where: 

Hcc = Hcc - Hcs Hgg Hsc = ^cc - Hpc 

9c = 9c - Hcs ^ss 9s = gc - EpHcp gp 

Here, ‘S’ selects the structure block and ‘C’ the camera one. Hgs is block diagonal, 
so the reduction can be calculated rapidly by a sum of contributions from the individual 
3D features ‘p’ in S. Brown’s original 1958 method for bundle adjustment [16, 19, 100] 
was based on finding the reduced camera system as above, and solving it using Gaussian 
elimination. Profile Cholesky decomposition (§B.3) offers a more streamlined method of 
achieving this. 

Occasionally, long image sequences have more camera parameters than structure ones. 
In this case it is more efficient to reduce the camera parameters, leaving a reduced structure 
system. 



6.2 Triangular Decompositions 

If D in (16) is further subdivided into blocks, the factorization process can be contin- 
ued recursively. In fact, there is a family of block (lower triangular)*(diagonal)*(upper 
triangular) factorizations A = L D U : 




See §B.l for computational details. The main advantage of triangular factorizations is that 
they make linear algebra computations with the matrix much easier. In particular, if the 
input matrix A is square and nonsingular, linear equations A X = b can be solved by a 
sequence of three recursions that implicitly implement multiplication by A^’^ = : 



Lc = b Ci ^ L~l \bi — Cjj forward substitution (22) 

D d = C di Ci diagonal solution (23) 

Ux = d Xi ^di — Zy>i Uij Xj^ back-substitution (24) 



Forward substitution corrects for the influence of earlier variables on later ones, diagonal 
solution solves the transformed system, and back-substitution propagates corrections due 
to later variables back to earlier ones. In practice, this is usual method of solving linear 
equations such as the Newton step prediction equations. It is stabler and much faster than 
explicitly inverting A and multiplying by A^h 
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The diagonal blocks La , Di , \Ju can be set arbitrarily provided that the product La Di \Ju 
remains constant. This gives a number of well-known factorizations, each optimized for a 
different class of matrices. Pivoting (row and/or column exchanges designed to improve 
the conditioning of L and/or U, §B.l) is also necessary in most cases, to ensure stability. 
Choosing La = Du — 1 gives the (block) LU decomposition A = L U, the matrix repre- 
sentation of (block) Gaussian elimination. Pivoted by rows, this is the standard method for 
non-symmetric matrices. For symmetric A, roughly half of the work of factorization can 
be saved by using a symmetry-preserving LDL^ factorization, for which D is symmetric 
and U = L^. The pivoting strategy must also preserve symmetry in this case, so it has to 
permute columns in the same way as the corresponding rows. If A is symmetric positive 
definite we can further set D = 1 to get the Cholesky decomposition A = L L^. This is 
stable even without pivoting, and hence extremely simple to implement. It is the standard 
decomposition method for almost all unconstrained optimization problems including bun- 
dle adjustment, as the Hessian is positive definite near a non-degenerate cost minimum 
(and in the Gauss-Newton approximation, almost everywhere else, too). If A is symmetric 
but only positive icm/definite, diagonally pivoted Cholesky decomposition can be used. 
This is the case, e.g. in subset selection methods of gauge fixing (§9.5). Finally, if A is 
symmetric but indefinite, it is not possible to reduce D stably to 1 . Instead, the Bunch- 
Kaufman method is used. This is a diagonally pivoted LDL^ method, where D has a 

( H C \ 

QT Q j of the La- 
grange multiplier method for constrained optimization problems (12) is always symmetric 
indefinite, so Bunch-Kaufman is the recommended method for solving constrained bundle 
problems. (It is something like 40% faster than Gaussian elimination, and about equally 
stable). 

Another use of factorization is matrix inversion. Inverses can be calculated by factoring, 
inverting each triangular factor by forwards or backwards substitution (52), and multiplying 
out; A“^ = However, explicit inverses are rarely used in numerical analysis, 

it being both stabler and much faster in almost all cases to leave them implicit and work 
by forward/backward substitution w.r.t. a factorization, rather than multiplication by the 
inverse. One place where inversion is needed in its own right, is to calculate the dispersion 
matrix (inverse Hessian, which asymptotically gives the posterior covariance) as a measure 
of the likely variability of parameter estimates. The dispersion can be calculated by explicit 
inversion of the factored Hessian, but often only a few of its entries are needed, e.g. the 
diagonal blocks and a few key off-diagonal parameter covariances. In this case (53) can be 
used, which efficiently calculates the covariance entries corresponding to just the nonzero 
elements of L, D, U. 

6.3 Sparse Factorization 

To apply the above decompositions to sparse matrices, we must obviously avoid storing 
and manipulating the zero blocks. But there is more to the subject than this. As a sparse 
matrix is decomposed, zero positions tend to rapidly fill in (become non-zero), essentially 
because decomposition is based on repeated linear combination of matrix rows, which 
is genetically non-zero wherever any one of its inputs is. Fill-in depends strongly on the 
order in which variables are eliminated, so efficient sparse factorization routines attempt 
to minimize either operation counts or fill-in by re-ordering the variables. (The Schur 
process is fixed in advance, so this is the only available freedom). Globally minimizing 
either operations or fill-in is NP complete, but reasonably good heuristics exist (see below). 
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Variable order affects stability (pivoting) as well as speed, and these two goals conflict to 
some extent. Finding heuristics that work well on both counts is still a research problem. 

Algorithmically, fill-in is characterized by an elimination graph derived from the pa- 
rameter coupling / Hessian graph [40, 26, 1 1]. To create this, nodes (blocks of parameters) 
are visited in the given elimination ordering, at each step linking together all unvisited 
nodes that are currently linked to the current node. The coupling of block i to block j via 
visited block k corresponds to a non-zero Schur contribution U kj , and at each stage 

the subgraph on the currently unvisited nodes is the coupling graph of the current reduced 
Hessian. The amount of fill-in is the number of new graph edges created in this process. 



Pattern Matrices We seek variable orderings that approximately minimize the total 
operation count or fill-in over the whole elimination chain. For many problems a suitable 
ordering can be fixed in advance, typically giving one of a few standard pattern matrices 
such as band or arrowhead matrices, perhaps with such structure at several levels. 
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bundle Hessian arrowhead matrix block tridiagonal matrix 

The most prominent pattern structure in bundle adjustment is the primary subdivision of 
the Hessian into structure and camera blocks. To get the reduced camera system (19), 
we treat the Hessian as an arrowhead matrix with a broad final column containing all of 
the camera parameters. Arrowhead matrices are trivial to factor or reduce by block 2x2 
Schur complementation, c.f. (16, 19). For bundle problems with many independent images 
and only a few features, one can also complement on the image parameter block to get a 
reduced structure system. 

Another very common pattern structure is the block tridiagonal one which characterizes 
all singly coupled chains (sequences of images with only pairwise overlap, Kalman filtering 
and other time recursions, simple kinematic chains). Tridiagonal matrices are factored or 
reduced by recursive block 2x2 Schur complementation starting from one end. The L 
and U factors are also block tridiagonal, but the inverse is generally dense. 

Pattern orderings are often very natural but it is unwise to think of them as immutable: 
structure often occurs at several levels and deeper structure or simply changes in the relative 
sizes of the various parameter classes may make alternative orderings preferable. For more 
difficult problems there are two basic classes of on-line ordering strategies. Bottom-up 
methods try to minimize fill-in locally and greedily at each step, at the risk of global short- 
sightedness. Top-down methods take a divide-and-conquer approach, recursively splitting 
the problem into smaller sub-problems which are solved quasi-independently and later 
merged. 



Top-Down Ordering Methods The most common top-down method is called nested dis- 
section or recursive partitioning [64, 57, 19, 38, 40, 1 1]. The basic idea is to recursively 
split the factorization problem into smaller sub-problems, solve these independently, and 
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Reverse Cuthill-McKee 



Fig. 4. A bundle Hessian for an irregular coverage problem with only local connections, and its 
Cholesky factor in natural (structure-then-camera), minimum degree, and reverse Cuthill-McKee 
ordering. 



then glue the solutions together along their common boundaries. Splitting involves choos- 
ing a separating set of variables, whose deletion will separate the remaining variables into 
two or more independent subsets. This corresponds to finding a (vertex) graph cut of the 
elimination graph, i.e. a set of vertices whose deletion will split it into two or more discon- 
nected components. Given such a partitioning, the variables are reordered into connected 
components, with the separating set ones last. This produces an ‘arrowhead’ matrix, e.g. : 
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(26) 



The arrowhead matrix is factored by blocks, as in reduction or profile Cholesky, tak- 
ing account of any internal sparsity in the diagonal blocks and the borders. Any suitable 
factorization method can be used for the diagonal blocks, including further recursive par- 
titionings. 







Bundle Adjustment — A Modem Synthesis 



325 



Nested dissection is most useful when comparatively small separating sets can be found. 
A trivial example is the primary structure of the bundle problem: the camera variables 
separate the 3D structure into independent features, giving the standard arrowhead form of 
the bundle Hessian. More interestingly, networks with good geometric or temporal locality 
(surface- and site-covering networks, video sequences) tend to have small separating sets 
based on spatial or temporal subdivision. The classic examples are geodesic and aerial 
cartography networks with their local 2D connections — spatial bisection gives simple 
and very efficient recursive decompositions for these [64, 57, 19]. 

For sparse problems with less regular structure, one can use graph partitioning algo- 
rithms to find small separating sets. Finding a globally minimal partition sequence is NP 
complete but several effective heuristics exist. This is currently an active research field. 
One promising family are multilevel schemes [70, 71, 65, 4] which decimate (subsample) 
the graph, partition using e.g. a spectral method, then refine the result to the original graph. 
(These algorithms should also be very well-suited to graph based visual segmentation and 
matching). 



Bottom-Up Ordering Methods Many bottom-up variable ordering heuristics exist. Prob- 
ably the most widespread and effective is minimum degree ordering. At each step, this 
eliminates the variable coupled to the fewest remaining ones {i.e. the elimination graph 
node with the fewest unvisited neighbours), so it minimizes the number of 

changed matrix elements and hence FLOPs for the step. The minimum degree ordering 
can also be computed quite rapidly without explicit graph chasing. A related ordering, 
minimum deficiency, minimizes the fill-in (newly created edges) at each step, but this is 
considerably slower to calculate and not usually so effective. 

Fill-in or operation minimizing strategies tend to produce somewhat fragmentary ma- 
trices that require pointer- or index-based sparse matrix implementations (see fig. 4). This 
increases complexity and tends to reduce cache locality and pipeline-ability. An alternative 
is to use profile matrices which (for lower triangles) store all elements in each row between 
the first non-zero one and the diagonal in a contiguous block. This is easy to implement 
(see §B.3), and practically efficient so long as about 30% or more of the stored elements are 
actually non-zero. Orderings for this case aim to minimize the sum of the profile lengths 
rather than the number of non-zero elements. Profiling enforces a multiply-linked chain 
structure on the variables, so it is especially successful for linear / chain-like / one dimen- 
sional problems, e.g. space or time sequences. The simplest profiling strategy is reverse 
Cuthill-McKee which chooses some initial variable (very preferably one from one ‘end’ 
of the chain), adds all variables coupled to that, then all variables coupled to those, etc., 
then reverses the ordering (otherwise, any highly-coupled variables get eliminated early 
on, which causes disastrous fill-in). More sophisticated are the so-called banker’s strate- 
gies, which maintain an active set of all the variables coupled to the already-eliminated 
ones, and choose the next variable — from the active set (King [72]), it and its neighbours 
(Snay [101]) or all uneliminated variables (Levy [75]) — to minimize the new size of the 
active set at each step. In particular, Snay’s banker’s algorithm is reported to perform 
well on geodesy and aerial cartography problems [101, 24]. 

For all of these automatic ordering methods, it often pays to do some of the initial work 
by hand, e.g. it might be appropriate to enforce the structure / camera division beforehand 
and only order the reduced camera system. If there are nodes of particularly high degree 
such as inner gauge constraints, the ordering calculation will usually run faster and the 
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quality may also be improved by removing these from the graph and placing them last by 
hand. 

The above ordering methods apply to both Cholesky / LDL^ decomposition of the 
Hessian and QR decomposition of the least squares Jacobian. Sparse QR methods can be 
implemented either with Givens rotations or (more efficiently) with sparse Householder 
transformations. Row ordering is important for the Givens methods [39]. For Householder 
ones (and some Givens ones too) the multifrontal organization is now usual [41, 11], as 
it captures the natural parallelism of the problem. 



7 Implementation Strategy 2: First Order Adjustment Methods 

We have seen that for large problems, factoring the Hessian H to compute the Newton 
step can be both expensive and (if done efficiently) rather complex. In this section we 
consider alternative methods that avoid the cost of exact factorization. As the Newton step 
can not be calculated, such methods generally only achieve first order (linear) asymptotic 
convergence : when close to the final state estimate, the error is asymptotically reduced by a 
constant (and in practice often depressingly small) factor at each step, whereas quadratically 
convergent Newton methods roughly double the number of significant digits at each step. 
So first order methods require more iterations than second order ones, but each iteration 
is usually much cheaper. The relative efficiency depends on the relative sizes of these 
two effects, both of which can be substantial. For large problems, the reduction in work 
per iteration is usually at least 0{n), where n is the problem size. But whereas Newton 
methods converge from 0{1) to O(l0“^®) in about 1 + log 2 16 = 5 iterations, linearly 
convergent ones take respectively log 10“^®/ log(l — 7) = 16,350,3700 iterations for 
reduction 7 = 0.9, 0.1, 0.01 per iteration. Unfortunately, reductions of only 1% or less are 
by no means unusual in practice (§7.2), and the reduction tends to decrease as n increases. 

7.1 First Order Iterations 

We first consider a number of common first order methods, before returning to the question 
of why they are often slow. 

Gradient descent: The simplest first order method is gradient descent, which “slides 
down the gradient” by taking ~ g or Ho = 1 . Line search is needed, to find an appro- 
priate scale for the step. For most problems, gradient descent is spectacularly inefficient 
unless the Hessian actually happens to be very close to a multiple of 1 . This can be arranged 
by preconditioning with a linear transform L, X — ^ L X, g — ^ g and H ^ H L \ 

where L ^ H is an approximate Cholesky factor (or other left square root) of H, so that 
H H ~ 1 . In this very special case, preconditioned gradient descent approxi- 

mates the Newton method. Strictly speaking, gradient descent is a cheat: the gradient is a 
covector (linear form on vectors) not a vector, so it does not actually define a direction in 
the search space. Gradient descent’s sensitivity to the coordinate system is one symptom 
of this. 

Alternation : Another simple approach is alternation : partition the variables into groups 
and cycle through the groups optimizing over each in turn, with the other groups held 
hxed. This is most appropriate when the subproblems are significantly easier to optimize 
than the full one. A natural and often-rediscovered alternation for the bundle problem is 
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resection-intersection, which interleaves steps of resection (finding the camera poses and 
if necessary calibrations from hxed 3D features) and intersection (finding the 3D features 
from fixed camera poses and calibrations). The subproblems for individual features and 
cameras are independent, so only the diagonal blocks of H are required. 

Alternation can be used in several ways. One extreme is to optimize (or perhaps only 
perform one step of optimization) over each group in turn, with a state update and re- 
evaluation of (the relevant components of) g, H after each group. Alternatively, some of 
the re-evaluations can be simulated by evaluating the linearized effects of the parameter 
group update on the other groups. E.g., for resection-intersection with structure update 
SXs = — H 55 95 (X 5 , Xc;) (where ‘S’ selects the structure variables and ‘C’ the camera 
ones), the updated camera gradient is exactly the gradient of the reduced camera system, 
gc(xs + Sxs,xc) « gc(xs, Xc) + Hcs^Xs = gc - Hcs Hg's gc. So the total update 

for the cycle is ) ( 9 c ) = ( Scs ) '( 9 c ) ' 

general, this correction propagation amounts to solving the system as if the above-diagonal 
triangle of H were zero. Once we have cycled through the variables, we can update the 
full state and relinearize. This is the nonlinear Gauss-Seidel method. Alternatively, we 
can split the above-diagonal triangle of H off as a correction (back-propagation) term 



and continue iterating 
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(hopefully) ( ) converges to the full Newton step Sx = — H “^g. This is the linear 

Gauss-Seidel method applied to solving the Newton step prediction equations. Finally, 
alternation methods always tend to underestimate the size of the Newton step because 
they fail to account for the cost-reducing effects of including the back-substitution terms. 
Successive Over- Relaxation (SOR) methods improve the convergence rate by artihcially 
lengthening the update steps by a heuristic factor 1 < 7 < 2 . 

Most if not all of the above alternations have been applied to both the bundle problem 
and the independent model one many times, e.g. [19, 95, 2, 108, 91, 20]. Brown considered 
the relatively sophisticated SOR method for aerial cartography problems as early as 1964, 
before developing his recursive decomposition method [19]. None of these alternations are 
very effective for traditional large-scale problems, although §7.4 below shows that they 
can sometimes compete for smaller highly connected ones. 



Krylov subspace methods : Another large family of iterative techniques are the Krylov 
subspace methods, based on the remarkable properties of the power subspaces 
Span({A^ b\k = 0 . . . n}) for fixed A, b as n increases. Krylov iterations predominate 
in many large-scale linear algebra applications, including linear equation solving. 

The earliest and greatest Krylov method is the conjugate gradient iteration for solving 
a positive dehnite linear system or optimizing a quadratic cost function. By augmenting the 
gradient descent step with a carefully chosen multiple of the previous step, this manages 
to minimize the quadratic model function over the entire Krylov subspace at the 
iteration, and hence (in exact arithmetic) over the whole space at the one. This no 
longer holds when there is round-off error, but 0{rix) iterations usually still suffice to hnd 
the Newton step. Each iteration is 0(ri^) so this is not in itself a large gain over explicit 
factorization. However convergence is signihcantly faster if the eigenvalues of H are tightly 
clustered away from zero: if the eigenvalues are covered by intervals [oj, &i]i=i...fc, conver- 
gence occurs in s/h/cii^ iterations [99, 47, 48] Preconditioning (see below) 

** For other eigenvalue based based analyses of the bundle adjustment covariance, see [103, 92]. 
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Fig. 5. An example of the typical behaviour of first and second order convergent methods near the 
minimum. This is a 2D projection of a small but ill-conditioned bundle problem along the two 
most variable directions. The second order methods converge quite rapidly, whether they use exact 
(Gauss-Newton) or iterative (diagonally preconditioned conjugate gradient) linear solver for the 
Newton equations. In contrast, first order methods such as resection-intersection converge slowly 
near the minimum owing to their inaccurate model of the Hessian. The effects of mismodelling can 
be reduced to some extent by adding a line search. 

aims at achieving such clustering. As with alternation methods, there is a range of possible 
update / re-linearization choices, ranging from a fully nonlinear method that relinearizes 
after each step, to solving the Newton equations exactly using many linear iterations. One 
major advantage of conjugate gradient is its simplicity: there is no factorization, all that is 
needed is multiplication by H. For the full nonlinear method, H is not even needed — one 
simply makes a line search to find the cost minimum along the direction defined by g and 
the previous step. 

One disadvantage of nonlinear conjugate gradient is its high sensitivity to the accuracy 
of the line search. Achieving the required accuracy may waste several function evaluations 
at each step. One way to avoid this is to make the information obtained by the conjugation 
process more explicit by building up an explicit approximation to H or Quasi-Newton 
methods such as the BFGS method do this, and hence need less accurate line searches. 
The quasi-Newton approximation to H or is dense and hence expensive to store and 
manipulate, but Limited Memory Quasi-Newton (LMQN) methods often get much of 
the desired effect by maintaining only a low-rank approximation. 

There are variants of all of these methods for least squares (Jacobian rather than Hessian 
based) and for constrained problems (non-positive definite matrices). 

7.2 Why Are First Order Methods Slow? 

To understand why first order methods often have slow convergence, consider the effect of 
approximating the Hessian in Newton’s method. Suppose that in some local parametriza- 
tion X centred at a cost minimum X = 0, the cost function is well approximated by a 
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quadratic near 0: f(x) « ^X^Hx and hence g(x) = Hx, where H is the true Hessian. 
For most first order methods, the predicted step is linear in the gradient g. If we adopt a 
Newton-like state update ^X = — g(x) based on some approximation Hq to H, we get 
an iteration : 

Xfc+i = X, - H-g(Xfe) « (1 - H-H) X, « (1 - H^H)"+' Xo (27) 

The numerical behaviour is determined by projecting Xq along the eigenvectors of 1 — H H . 
The components corresponding to large-modulus eigenvalues decay slowly and hence 
asymptotically dominate the residual error. For generic Xq, the method converges ‘linearly’ 
(i.e. exponentially) at rate || 1 — H|| 2 , or diverges if this is greater than one. (Of course, 

the exact Newton step Sx = — H^^g converges in a single iteration, as Hq = H). Along 
eigen-directions corresponding to positive eigenvalues (for which Ho overestimates H), 
the iteration is over-damped and convergence is slow but monotonic. Conversely, along 
directions corresponding to negative eigenvalues (for which Ho underestimates H), the 
iteration is under-damped and zigzags towards the solution. If H is underestimated by a 
factor greater than two along any direction, there is divergence. Figure 5 shows an example 
of the typical asymptotic behaviour of first and second order methods in a small bundle 
problem. 

Ignoring the camera-feature coupling : As an example, many approximate bundle meth- 
ods ignore or approximate the off-diagonal feature-camera blocks of the Hessian. This 
amounts to ignoring the fact that the cost of a feature displacement can be partially offset 
by a compensatory camera displacement and vice versa. It therefore significantly over- 
estimates the total ‘stiffness’ of the network, particularly for large, loosely connected 
networks. The fact that off-diagonal blocks are not negligible compared to the diagonal 
ones can be seen in several ways: 

• Looking forward to §9, before the gauge is fixed, the full Hessian is singular owing to 
gauge freedom. The diagonal blocks by themselves are well-conditioned, but including 
the off-diagonal ones entirely cancels this along the gauge orbit directions. Although 
gauge fixing removes the resulting singularity, it can not change the fact that the off- 
diagonal blocks have enough weight to counteract the diagonal ones. 

• In bundle adjustment, certain well-known ambiguities (poorly-controlled parameter 
combinations) often dominate the uncertainty. Camera distance and focal length es- 
timates, and structure depth and camera baseline ones (bas-relief), are both strongly 
correlated whenever the perspective is weak and become strict ambiguities in the affine 
limit. The well-conditioned diagonal blocks of the Hessian give no hint of these ambi- 
guities : when both features and cameras are free, the overall network is much less rigid 
than it appears to be when each treats the other as fixed. 

• During bundle adjustment, local structure refinements cause ‘ripples’ that must be prop- 
agated throughout the network. The camera-feature coupling information carried in the 
off-diagonal blocks is essential to this. In the diagonal-only model, ripples can propa- 
gate at most one feature-camera-feature step per iteration, so it takes many iterations 
for them to cross and re-cross a sparsely coupled network. 

These arguments suggest that any approximation H„ to the bundle Hessian H that sup- 
presses or significantly alters the off-diagonal terms is likely to have large ||1 — Hjj^H|| 
and hence slow convergence. This is exactly what we have observed in practice for all 
such methods that we have tested : near the minimum, convergence is linear and for large 
problems often extremely slow, with ||1 — Hjj^H ||2 very close to 1. The iteration may 
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either zigzag or converge slowly and monotonically, depending on the exact method and 
parameter values. 

Line search : The above behaviour can often be improved significantly by adding a line 
search to the method. In principle, the resulting method converges for any positive dehnite 
Ha. However, accurate modelling of H is still highly desirable. Even with no rounding 
errors, an exactly quadratic (but otherwise unknown) cost function and exact line searches 
(i.e. the minimum along the line is found exactly), the most efficient generic line search 
based methods such as conjugate gradient or quasi-Newton require at least O (ux ) iterations 
to converge. For large bundle problems with thousands of parameters, this can already be 
prohibitive. However, if knowledge about H is incorporated via a suitable preconditioner, 
the number of iterations can often be reduced substantially. 



7.3 Preconditioning 

Gradient descent and Krylov methods are sensitive to the coordinate system and their 
practical success depends critically on good preconditioning. The aim is to find a linear 
transformation X -> T X and hence g T“^ g and H — ^ T“^ H T for which the trans- 
formed H is near 1 , or at least has only a few clusters of eigenvalues well separated from 
the origin. Ideally, T should be an accurate, low-cost approximation to the left Cholesky 
factor of H. (Exactly evaluating this would give the expensive Newton method again). In 
the experiments below, we tried conjugate gradient with preconditioners based on the di- 
agonal blocks of H, and on partial Cholesky decomposition, dropping either all hlled-in 
elements, or all that are smaller than a preset size when performing Cholesky decomposi- 
tion. These methods were not competitive with the exact Gauss-Newton ones in the ‘strip’ 
experiments below, but for large enough problems it is likely that a preconditioned Krylov 
method would predominate, especially if more effective preconditioners could be found. 

An exact Cholesky factor of H from a previous iteration is often a quite effective 
preconditioner. This gives hybrid methods in which H is only evaluated and factored every 
few iterations, with the Newton step at these iterations and well-preconditioned gradient 
descent or conjugate gradient at the others. 

7.4 Experiments 

Figure 6 shows the relative performance of several methods on two synthetic projective 
bundle adjustment problems. In both cases, the number of 3D points increases in proportion 
to the number of images, so the dense factorization time is O(n^) where n is the number 
of points or images. The following methods are shown: ‘Sparse Gauss-Newton’ — sparse 
Cholesky decomposition with variables ordered naturally (features then cameras) ; ‘Dense 
Gauss-Newton’ — the same, but (inefficiently) ignoring all sparsity of the Hessian; ‘Diag. 
Conj. Gradient’ — the Newton step is found by an iterative conjugate gradient linear 
system solver, preconditioned using the Cholesky factors of the diagonal blocks of the 
Hessian; ‘Resect-Intersect’ — the state is optimized by alternate steps of resection and 
intersection, with relinearization after each. In the ‘spherical cloud’ problem, the points 
are uniformly distributed within a spherical cloud, all points are visible in all images, 
and the camera geometry is strongly convergent. These are ideal conditions, giving a low 
diameter network graph and a well-conditioned, nearly diagonal-dominant Hessian. All 
of the methods converge quite rapidly. Resection-intersection is a competitive method for 
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Computation vs. Bundle Size -- Strong Geometry 




no. of images 

Computation vs. Bundle Size -- Weak Geometry 




Fig. 6. Relative speeds of various bundle optimization methods for strong ‘spherical cloud’ and weak 
‘strip’ geometries. 



larger problems owing to its low cost per iteration. Unfortunately, although this geometry 
is often used for testing computer vision algorithms, it is atypical for large-scale bundle 
problems. The ‘strip’ experiment has a more representative geometry. The images are 
arranged in a long strip, with each feature seen in about 3 overlapping images. The strip’s 
long thin weakly-connected network structure gives it large scale low stiffness ‘flexing’ 
modes, with correspondingly poor Hessian conditioning. The off-diagonal terms are critical 
here, so the approximate methods perform very poorly. Resection-intersection is slower 
even than dense Cholesky decomposition ignoring all sparsity. For 16 or more images 
it fails to converge even after 3000 iterations. The sparse Cholesky methods continue to 
perform reasonably well, with the natural, minimum degree and reverse Cuthill-McKee 
orderings all giving very similar run times in this case. For all of the methods that we 
tested, including resection-intersection with its linear per-iteration cost, the total run time 
for long chain-like geometries scaled roughly as 0{n^). 
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8 Implementation Strategy 3 : Updating and Recursion 

8.1 Updating Rules 

It is often convenient to be able to update a state estimate to reflect various types of 
changes, e.g. to incorporate new observations or to delete erroneous ones (‘downdating’). 
Parameters may have to he added or deleted too. Updating rules are often used recursively, 
to incorporate a series of observations one-by-one rather than solving a single batch system. 
This is useful in on-line applications where a rapid response is needed, and also to provide 
preliminary predictions, e.g. for correspondence searches. Much of the early development 
of updating methods was aimed at on-line data editing in aerial cartography workstations. 

The main challenge in adding or deleting observations is efficiently updating either a 
factorization of the Hessian H, or the covariance H \ Given either of these, the state update 

is easily found by solving the Newton step equations H^X = — g, where (assuming 
that we started at an un-updated optimum g = 0) the gradient g depends only on the newly 
added terms. The Hessian update H --s- H ± B W needs to have relatively low rank, 
otherwise nothing is saved over recomputing the batch solution. In least squares the rank is 
the number of independent observations added or deleted, but even without this the rank is 
often low in bundle problems because relatively few parameters are affected by any given 
observation. 

One limitation of updating is that it is seldom as accurate as a batch solution owing to 
build-up of round-off error. Updating (adding observations) itself is numerically stable, but 
downdating (deleting observations) is potentially ill-conditioned as it reduces the positivity 
of the Hessian, and may cause previously good pivot choices to become arbitrarily bad. 
This is particularly a problem if all observations relating to a parameter are deleted, or 
if there are repeated insertion-deletion cycles as in time window filtering. Factorization 
updating methods are stabler than Woodbury formula / covariance updating ones. 

Consider first the case where no parameters need be added nor deleted, e.g. adding or 
deleting an observation of an existing point in an existing image. Several methods have been 
suggested [54,66]. Mikhail & Helmering [88] use the Woodbury formula (18) to update 
the covariance This simple approach becomes inefficient for problems with many 
features because the sparse structure is not exploited: the full covariance matrix is dense 
and we would normally avoid calculating it in its entirety. Griin [5 1 , 54] avoids this problem 
by maintaining a running copy of the reduced camera system (20), using an incremental 
Schur complement / forward substitution (16) to fold each new observation into this, and 
then re-factorizing and solving as usual after each update. This is effective when there are 
many features in a few images, but for larger numbers of images it becomes inefficient 
owing to the re-factorization step. Factorization updating methods such as (55,56) are 
currently the recommended update methods for most applications : they allow the existing 
factorization to be exploited, they handle any number of images and features and arbitrary 
problem structure efficiently, and they are numerically more accurate than Woodbury 
formula methods. The Givens rotation method [12,54], which is equivalent to the rank 
1 Cholesky update (56), is probably the most common such method. The other updating 
methods are confusingly named in the literature. Mikhail & Helmering’s method [88] 
is sometimes called ‘Kalman filtering’, even though no dynamics and hence no actual 
filtering is involved. Griin’s reduced camera system method [51] is called ‘triangular factor 
update (TFU)’, even though it actually updates the (square) reduced Hessian rather than 
its triangular factors. 
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For updates involving a previously unseen 3D feature or image, new variables must 
also be added to the system. This is easy. We simply choose where to put the variables in 
the elimination sequence, and extend H and its L,D,U factors with the corresponding rows 
and columns, setting all of the newly created positions to zero (except for the unit diagonals 
of LDL^’s and LU’s L factor). The factorization can then be updated as usual, presumably 
adding enough cost terms to make the extended Hessian nonsingular and couple the new 
parameters into the old network. If a direct covariance update is needed, the Woodbury 
formula (18) can be used on the old part of the matrix, then (17) to fill in the new blocks 
(equivalently, invert (54), with Di A representing the old blocks and D 2 0 the new 
ones). 

Conversely, it may be necessary to delete parameters, e.g. if an image or 3D feature 
has lost most or all of its support. The corresponding rows and columns of the Hessian 
H (and rows of g, columns of J) must be deleted, and all cost contributions involving the 
deleted parameters must also be removed using the usual factorization downdates (55, 56). 
To delete the rows and columns of block 6 in a matrix A, we first delete the b rows and 
columns of L, D, U. This maintains triangularity and gives the correct trimmed A, except 
that the blocks in the lower right corner A^ = J2k<min{i j) Dfc Ukj, i,j > b are 
missing a term Df, from the deleted column 6 of L / row b of U. This is added using 
an update +L*h Df, U;,*, * > 6. To update A^"^ when rows and columns of A are deleted, 
permute the deleted rows and columns to the end and use (17) backwards: (An)”"^ = 

~ (A“^)i2 (A“^)2 i- 

It is also possible to freeze some live parameters at fixed (current or default) values, 
or to add extra parameters / unfreeze some previously frozen ones, c.f. (48, 49) below. In 
this case, rows and columns corresponding to the frozen parameters must be deleted or 
added, but no other change to the cost function is required. Deletion is as above. To insert 
rows and columns Af,*, A*t, at block b of matrix A, we open space in row and column b of 
L, D, U and fill these positions with the usual recursively defined values (51). For i,j > b, 
the sum (51) will now have a contribution Ln, \Jbj that it should not have, so to correct 
this we downdate the lower right submatrix * > b with a cost cancelling contribution 



8.2 Recursive Methods and Reduction 

Each update computation is roughly quadratic in the size of the state vector, so if new 
features and images are continually added the situation will eventually become unman- 
ageable. We must limit what we compute. In principle parameter refinement never stops: 
each observation update affects all components of the state estimate and its covariance. 
However, the refinements are in a sense trivial for parameters that are not directly coupled 
to the observation. If these parameters are eliminated using reduction (19), the observa- 
tion update can be applied directly to the reduced Hessian and gradient'^. The eliminated 
parameters can then be updated by simple back-substitution (19) and their covariances by 
(17). In particular, if we cease to receive new information relating to a block of parameters 
(an image that has been fully treated, a 3D feature that has become invisible), they and 
all the observations relating to them can be subsumed once-and-for-all in a reduced Hes- 
sian and gradient on the remaining parameters. If required, we can later re-estimate the 

In (19), only D and 62 are affected by the observation as it is independent of the subsumed 
components A, B, C, bi. So applying the update to D, b 2 has the same effect as applying it to 

D,b 2 . 
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eliminated parameters by back-substitution. Otherwise, we do not need to consider them 
further. 

This elimination process has some limitations. Only ‘dead’ parameters can be elim- 
inated: to merge a new observation into the problem, we need the current Hessian or 
factorization entries for all parameter blocks relating to it. Reduction also commits us to 
a linearized / quadratic cost approximation for the eliminated variables at their current 
estimates, although to the extent that this model is correct, the remaining variables can still 
be treated nonlinearly. It is perhaps best to view reduction as the hrst half-iteration of a full 
nonlinear optimization: by (19), the Newton method for the full model can be implemented 
by repeated cycles of reduction, solving the reduced system, and back-substitution, with 
relinearization after each cycle, whereas for eliminated variables we stop after solving the 
hrst reduced system. Equivalently, reduction evaluates just the reduced components of the 
full Newton step and the full covariance, leaving us the option of computing the remaining 
eliminated ones later if we wish. 

Reduction can be used to rehne estimates of relative camera poses (or fundamental 
matrices, etc.) for a hxed set of images, by reducing a sequence of feature correspondences 
to their camera coordinates. Or conversely, to rehne 3D structure estimates for a hxed set 
of features in many images, by reducing onto the feature coordinates. 

Reduction is also the basis of recursive (Kalman) hltering. In this case, one has a {e.g. 
time) series of system state vectors linked by some probabilistic transition rule ( ‘dynamical 
model’), for which we also have some observations (‘observation model’). The parameter 
space consists of the combined state vectors for all times, i.e. it represents a path through 
the states. Both the dynamical and the observation models provide “observations” in the 
sense of probabilistic constraints on the full state parameters, and we seek a maximum like- 
lihood (or similar) parameter estimate / path through the states. The full Hessian is block 
tridiagonal: the observations couple only to the current state and give the diagonal blocks, 
and dynamics couples only to the previous and next ones and gives the off-diagonal blocks 
(differential observations can also be included in the dynamics likelihood). So the model 
is large (if there are many time steps) but very sparse. As always with a tri diagonal matrix, 
the Hessian can be decomposed by recursive steps of reduction, at each step Schur comple- 
menting to get the current reduced block Hj from the previous one H(_i, the off-diagonal 
(dynamical) coupling Ha-i and the current unreduced block (observation Hessian) H*: 
Ht = Ht - Htt-i Similarly, for the gradient - Htt-i Hjli 

and as usual the reduced state update is g^. 

This forwards reduction process is called filtering. At each time step it finds the optimal 
(linearized) current state estimate given all of the previous observations and dynamics. The 
corresponding unwinding of the recursion by back-substitution, smoothing, finds the opti- 
mal state estimate at each time given both past and future observations and dynamics. The 
usual equations of Kalman hltering and smoothing are easily derived from this recursion, 
but we will not do this here. We emphasize that hltering is merely the hrst half-iteration of 
a nonlinear optimization procedure: even for nonlinear dynamics and observation models, 
we can hnd the exact maximum likelihood state path by cyclic passes of hltering and 
smoothing, with relinearization after each. 

For long or unbounded sequences it may not be feasible to run the full iteration, but 
it can still be very helpful to run short sections of it, e.g. smoothing back over the last 
3-4 state estimates then hltering forwards again, to verify previous correspondences and 
anneal the effects of nonlinearities. (The traditional extended Kalman filter optimizes 




Bundle Adjustment — A Modem Synthesis 



335 



Reconstruction Error vs. Time Window Size 




time window size 



Fig. 7. The residual state estimation error of the VSDF sequential bundle algorithm for progressively 
increasing sizes of rolling time window. The residual error at image t = 16 is shown for rolling 
windows of 1-5 previous images, and also for a ‘batch’ method (all previous images) and a ‘simple’ 
one (reconstmction / intersection is performed independently of camera location / resection). To 
simulate the effects of decreasing amounts of image data, 0%, 15% and 70% of the image measure- 
ments are randomly deleted to make runs with 100%, 85% and only 30% of the supplied image data. 
The main conclusion is that window size has little effect for strong data, but becomes increasingly 
important as the data becomes weaker. 

nonlinearly over just the current state, assuming all previous ones to be linearized). The 
effects of variable window size on the Variable State Dimension Filter (VSDF) sequential 
bundle algorithm [85, 86, 83, 84] are shown in figure 7. 



9 Gauge Freedom 

Coordinates are a very convenient device for reducing geometry to algebra, but they come 
at the price of some arbitrariness. The coordinate system can be changed at any time, 
without affecting the underlying geometry. This is very familiar, but it leaves us with two 
problems : (;) algorithmically, we need some concrete way of deciding which particular 
coordinate system to use at each moment, and hence breaking the arbitrariness ; (ii) we 
need to allow for the fact that the results may look quite different under different choices, 
even though they represent the same underlying geometry. 

Consider the choice of 3D coordinates in visual reconstruction. The only objects in the 
3D space are the reconstructed cameras and features, so we have to decide where to place 
the coordinate system relative to these ... Or in coordinate-centred language, where to 
place the reconstruction relative to the coordinate system. Moreover, bundle adjustment 
updates and uncertainties can perturb the reconstructed structure almost arbitrarily, so 
we must specify coordinate systems not just for the current structure, but also for every 
possible nearby one. Ultimately, this comes down to constraining the coordinate values 
of certain aspects of the reconstructed structure — features, cameras or combinations of 
these — whatever the rest of the structure might be. Saying this more intrinsically, the 
coordinate frame is specified and held fixed with respect to the chosen reference elements, 
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and the rest of the geometry is then expressed in this frame as usual. In measurement 
science such a set of coordinate system specifying rules is called a datum, but we will 
follow the wider mathematics and physics usage and call it a gauge'^. The freedom in the 
choice of coordinate hxing rules is called gauge freedom. 

As a gauge anchors the coordinate system rigidly to its chosen reference elements, per- 
turbing the reference elements has no effect on their own coordinates. Instead, it changes 
the coordinate system itself and hence systematically changes the coordinates of all the 
other features, while leaving the reference coordinates fixed. Similarly, uncertainties in 
the reference elements do not affect their own coordinates, but appear as highly correlated 
uncertainties in all of the other reconstructed features. The moral is that structural pertur- 
bations and uncertainties are highly relative. Their form depends profoundly on the gauge, 
and especially on how this changes as the state varies ii.e. which elements it holds fixed). 
The effects of disturbances are not restricted to the coordinates of the features actually 
disturbed, but may appear almost anywhere depending on the gauge. 

In visual reconstruction, the differences between object-centred and camera-centred 
gauges are often particularly marked. In object-centred gauges, object points appear to be 
relatively certain while cameras appear to have large and highly correlated uncertainties. 
In camera-centred gauges, it is the camera that appears to be precise and the object points 
that appear to have large correlated uncertainties. One often sees statements like “the 
reconstructed depths are very uncertain”. This may be true in the camera frame, yet the 
object may be very well reconstructed in its own frame — it all depends on what fraction 
of the total depth fluctuations are simply due to global uncertainty in the camera location, 
and hence identical for all object points. 

Besides 3D coordinates, many other types of geometric parametrization in vision in- 
volve arbitrary choices, and hence are subject to gauge freedoms [106]. These include the 
choice of: homogeneous scale factors in homogeneous-projective representations; sup- 
porting points in supporting-point based representations of lines and planes ; reference 
plane in plane ■¥ parallax representations ; and homographies in homography-epipole rep- 
resentations of matching tensors. In each case the symptoms and the remedies are the 
same. 

9.1 General Formulation 

The general set up is as follows : We take as our state vector X the set of all of the 3D feature 
coordinates, camera poses and calibrations, etc., that enter the problem. This state space 
has internal symmetries related to the arbitrary choices of 3D coordinates, homogeneous 
scale factors, etc., that are embedded in X. Any two state vectors that differ only by such 
choices represent the same underlying 3D geometry, and hence have exactly the same image 
projections and other intrinsic properties. So under change-of-coordinates equivalence, the 
state space is partitioned into classes of intrinsically equivalent state vectors, each class 
representing exactly one underlying 3D geometry. These classes are called gauge orbits. 
Formally, they are the group orbits of the state space action of the relevant gauge group 
(coordinate transformation group), but we will not need the group structure below. A state 
space function represents an intrinsic function of the underlying geometry if and only if 
it is constant along each gauge orbit {i.e. coordinate system independent). Such quantities 

Here, gauge just means reference frame. The sense is that of a reference against which something 

is judged (O.Fr. jauger, gauger). Pronounce ge^dj. 
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Covariance depends on chosen gauge 



Gauge constraints fix coordinates 
for each nearby structure 

Project along orbits to change gauge 



Cost function is 
constant along orbits 



Gauge orbits foliate 
parameter space 



Fig. 8. Gauge orbits in state space, two gauge cross-sections and their covariances. 



are called gauge invariants. We want the bundle adjustment cost function to quantify 
‘intrinsic merit’, so it must be chosen to be gauge invariant. 

In visual reconstruction, the principal gauge groups are the 3 + 3 + 1 = 7 dimen- 
sional group of 3D similarity (scaled Euclidean) transformations for Euclidean recon- 
struction, and the 15 dimensional group of projective 3D coordinate transformations for 
projective reconstruction. But other gauge freedoms are also present. Examples include: 
(/) The arbitrary scale factors of homogeneous projective feature representations, with 
their ID rescaling gauge groups, (ii) The arbitrary positions of the points in ‘two point’ 
line parametrizations, with their two ID motion- along-line groups. (Hi) The underspecihed 
3x3 homographies used for ‘homography -i- epipole’ parametrizations of matching tensors 
[77,62, 106]. Eor example, the fundamental matrix can be parametrized as F = [e]^ H 
where e is its left epipole and H is the inter-image homography induced by any 3D plane. 
The choice of plane gives a freedom H H + ea^ where a is an arbitrary 3 -vector, and 
hence a 3D linear gauge group. 

Now consider how to specify a gauge, i.e. a rule saying how each possible underlying 
geometry near the current one should be expressed in coordinates. Coordinatizations are 
represented by state space points, so this is a matter of choosing exactly one point (structure 
coordinatization) from each gauge orbit (underlying geometry). Mathematically, the gauge 
orbits foliate (fill without crossing) the state space, and a gauge is a local transversal 
‘cross-section’ Q through this foliation. See hg. 8. Different gauges represent different but 
geometrically equivalent coordinatization rules. Results can be mapped between gauges 
by pushing them along gauge orbits, i.e. by applying local coordinate transformations that 
vary depending on the particular structure involved. Such transformations are called S- 
transforms (‘similarity’ transforms) [6, 107,22,25]. Different gauges through the same 
central state represent coordinatization rules that agree for the central geometry but differ 
for perturbed ones — the S-transform is the identity at the centre but not elsewhere. 

Given a gauge, only state perturbations that lie within the gauge cross-section are autho- 
rized. This is what we want, as such state perturbations are in one-to-one correspondence 
with perturbations of the underlying geometry. Indeed, any state perturbation is equivalent 
to some on-gauge one under the gauge group (i.e. under a small coordinate transformation 
that pushes the perturbed state along its gauge orbit until it meets the gauge cross-section). 
State perturbations along the gauge orbits are uninteresting, because they do not change 
the underlying geometry at all. 
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Covariances are averages of squared perturbations and must also be based on on-gauge 
perturbations (they would be infinite if we permitted perturbations along the gauge orhits, 
as there is no limit to these — they do not change the cost at all). So covariance matrices 
are gauge-dependent and in fact represent ellipsoids tangent to the gauge cross-section at 
the cost minimum. They can look very different for different gauges. But, as with states, S- 
transforms map them between gauges hy projecting along gauge orhits / state equivalence 
classes. 

Note that there is no intrinsic notion of orthogonality on state space, so it is meaningless 
to ask which state-space directions are ‘orthogonal’ to the gauge orhits. This would in- 
volve deciding when two different structures have been “expressed in the same coordinate 
system”, so every gauge believes its own cross section to be orthogonal and all others to 
be skewed. 



9.2 Gauge Constraints 

We will work near some point X of state space, perhaps a cost minimum or a running state 
estimate. Let Ux be the dimension of X and rig the dimension of the gauge orbits. Let f , g, H 
be the cost function and its gradient and Hessian, and G be any Ux x rig matrix whose 
columns span the local gauge orbit directions at X By the exact gauge invariance of f, 
its gradient and Hessian vanish along orbit directions: G = 0 and H G = 0. Note that 

the gauged Hessian H is singular with (at least) rank deficiency rig and null space G. This 
is called gauge deficiency. Many numerical optimization routines assume nonsingular H, 
and must be modified fo work in gauge invariant problems. The singularity is an expression 
of indifference: when we come to calculate state updates, any two updates ending on the 
same gauge orbit are equivalent, both in terms of cost and in terms of the change in the 
underlying geometry. All that we need is a method of telling the routine which particular 
update to choose. 

Gauge constraints are the most direct means of doing this. A gauge cross-section Q 
can be specified in two ways: (f) constrained form: specify rig local constraints d(x) 
with d(x) = 0 for points on Q-, (ii) parametric form: specify a function x(y) of Ux — Ug 
independent local parameters y, with X = x(y) being the points of Q. For example, a 
trivial gauge is one that simply freezes the values of rzg of the parameters in X (usually 
feature or camera coordinates). In this case we can take d(x) to be the parameter freezing 
constraints and y to be the remaining unfrozen parameters. Note that once the gauge is 
fixed the problem is no longer gauge invariant — the whole purpose of d(x), x(y) is to 
break the underlying gauge invariance. 

Examples of trivial gauges include: (i) using several visible 3D points as a ‘projective 
basis’ for reconstruction (i.e. fixing fheir projective 3D coordinafes fo simple values, as 
in [27]); and (ii) fixing fhe components of one projective 3x4 camera matrix as (/ 0), 
as in [61] (this only partially fixes the 3D projective gauge — 3 projective 3D degrees of 
freedom remain unfixed). 



A suitable G is easily calculated from the infinitesimal action of the gauge group on X. For 
example, for spatial similarities the columns of G would be the n,g = 33-3-1-1 = 7 state velocity 
vectors describing the effects of infinitesimal translations, rotations and changes of spatial scale 
on X. 
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Linearized gauge : Let the local linearizations of the gauge functions be : 

d(x + ^x) « d(x) + D^x D = g (28) 

x(y + 5y) « x(y)+Y5y Y=g (29) 

Compatibility between the two gauge specification methods requires d(x(y)) = 0 for all 
y, and hence D Y = 0. Also, since Q must be transversal to the gauge orbits, D G must 
have full rank rzg and (Y G) must have full rank rix- Assuming that X itself is on Q, a 
perturbation X + dXg is on Q to first order iff D 6Xg = 0 or dXg = Y <5y for some <5y. 

Two rix X nx rank rix — rzg matrices characterize Q. The gauge projection matrix Pg 
implements linearized projection of state displacement vectors 6x along their gauge orbits 
onto the local gauge cross-section: ^X -A 5Xg = Pg Sx. (The projection is usually non- 
orthogonal: Pg ^ Pg). The gauged covariance matrix Vg plays the role of the inverse 
Hessian. It gives the cost-minimizing Newton step within Q, 6Xg = —Mg g, and also 
the asymptotic covariance of 5Xg. Pg and Vq have many special properties and equivalent 
forms. For convenience, we display some of these now'^ — let V = (H -f B where 
B is any nonsingular symmetric rzg x ng matrix, and let Q' be any other gauge : 

Vg = Y(Y^HY)^"Y^ = VHV = V - G (D G)~" B^^ (D G)-^G^ (30) 

= PpV = PqVe = PsVs-PJ (31) 

Pg = 1 - G (DG)^"D = Y(Y^HY)^'Y^H = VH = VgH = Pg Pg. (32) 
PgG = 0, PgY = Y, DPg == DVg = 0 (33) 

g"Pg = g^ HPg = H, Vgg-Vg (34) 



These relations can be summarized by saying that Vg is the (/-supported generalized inverse 
of H and that Pg : (i) projects along gauge orbits (Pg G = 0); (ii) projects onto the gauge 
cross-section (/ (D Pg = 0, Pg Y = Y, PgSx — Sxg and Vg = Pq Vg. P^); and (Hi) 
preserves gauge invariants (e.g. f(x + Pg <5x) = f(x + ^x), g^ Pg = g^ and H Pg = H). 
Both Vg and H have rank Ux — Ug. Their null spaces and G are transversal but otherwise 
unrelated. Pg has left null space D and right null space G. 

State updates: It is straightforward to add gauge fixing to the bundle update equations. 
First consider the constrained form. Enforcing the gauge constraints d(x + ^Xg) = 0 with 
Lagrange multipliers A gives an SQP step: 



fHD-\fSxg\__fg\ Vg G(DG)-\ 

so Sxg = -(Vgg-f G(DG)-'d) , A = 0 



(35) 

(36) 



This is a rather atypical constrained problem. For typical cost functions the gradient has a 
component pointing away from the constraint surface, so g 0 at the constrained minimum 

These results are most easily proved by inserting strategic factors of (Y G) (Y G)^^ and 
using HG = 0, DY = 0 and (Y G)~“^ = ^ (DG)^^D rig x rig B in- 
cluding 0, ( Q-|- ) (H -f B D) (Y G) = ^ • If B is nonsingular, 

V = (H + B □)■' = Y (Y"" H Y)-' Y"" + G (D G)~' B-^ (D G)-^ G^. 




340 



B. Triggs et al. 



and a non-vanishing force A 7 ^ 0 is required to hold the solution on the constraints. Here, 
the cost function and its derivatives are entirely indifferent to motions along the orbits. 
Nothing actively forces the state to move off the gauge, so the constraint force A vanishes 
everywhere, g vanishes at the optimum, and the constrained minimum value of f is identical 
to the unconstrained minimum. The only effect of the constraints is to correct any gradual 
drift away from Q that happens to occur, via the d term in 6Xg. 

A simpler way to get the same effect is to add a gauge-invariance breaking term such 
as id(x)^Bd(x) to the cost function, where B is some positive Ug x Ug weight matrix. 
Note that id(x)^ B d(x) has a unique minimum of 0 on each orbit at the point d(x) = 0, 
i.e. for X on Q. As f is constant along gauge orbits, optimization of f(x) + id(x)^ B d(x) 
along each orbit enforces Q and hence returns the orbit’s f value, so global optimization 
will find the global constrained minimum of f. The cost function f(x) 4 - id(x)^Bd(x) 
is nonsingular with Newton step 6xg = V (g -f B d) where V = (H -f B is 
the new inverse Hessian. By (34, 30), this is identical to the SQP step (36), so the SQP 
and cost-modifying methods are equivalent. This strategy works only because no force is 
required to keep the state on-gauge — if this were not the case, the weight B would have 
to be infinite. Also, for dense D this form is not practically useful because H -f B D is 
dense and hence slow to factorize, although updating formulae can be used. 

Finally, consider the parametric form X == x(y) of Q. Suppose that we already have a 
current reduced state estimate y. We can approximate f(x(y + <5y)) to get a reduced system 
for Jy, solve this, and find SXg afterwards if necessary: 

(WHY) ^y = -Y^g, ^Xg = Y^y = -V^g (37) 

The (nx — Ug) x (nx — rig) matrix Y^ H Y is generically nonsingular despite the singularity 
of H. In the case of a trivial gauge, Y simply selects the submatrices of g, H corresponding 
to the unfrozen parameters, and solves for these. For less trivial gauges, both Y and D are 
often dense and there is a risk that substantial fill-in will occur in all of the above methods. 

Gauged covariance: By (30) and standard covariance propagation in (37), the covariance 
of the on-gauge fluctuations Sxg is E [<5xg 5Xg] = Y (WH Y)”"^ Y^ = Vp. 6xg never 
moves off (/, so Vp represents a rank — Ug covariance ellipsoid ‘flattened onto Q’ Ana 
trivial gauge, Vp is the covariance (W H Y)~’^ of the free variables, padded with zeros for 
the fixed ones. ^ 

Given Vp, the linearized gauged covariance of a function h(x) is ^ Vp ^ as usual. 
If h(x) is gauge invariant (constant along gauge orbits) this is just its ordinary covariance. 

Intuitively, Vp and ^ Vp ^ depend on the gauge because they measure not absolute 
uncertainty, but uncertainty relative to the reference features on which the gauge is based. 
Just as there are no absolute reference frames, there are no absolute uncertainties. The best 
we can do is relative ones. 

Gauge transforms : We can change the gauge at will during a computation, e.g. to improve 
sparseness or numerical conditioning or re-express results in some standard gauge. This 
is simply a matter of an S-transform [ 6 ], i.e. pushing all gauged quantities along their 
gauge orbits onto the new gauge cross-section Q. We will assume that the base point X 
is unchanged. If not, a fixed (structure independent) change of coordinates achieves this. 
Locally, an S-transform then linearizes into a linear projection along the orbits spanned by 
G onto the new gauge constraints given by D or Y. This is implemented by the rix x Uy, rank 
Uy — Ug non-orthogonal projection matrix Pp defined in (32). The projection preserves all 
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gauge invariants — e.g. f(x+ Pp 5x) = f(x + ^x) — and it cancels the effects of projection 
onto any other gauge: Pg Pg/ — Pg. 

9.3 Inner Constraints 

Given the wide range of gauges and the significant impact that they have on the appearance 
of the state updates and covariance matrix, it is useful to ask which gauges give the 
“smallest” or “best behaved” updates and covariances. This is useful for interpreting and 
comparing results, and it also gives beneficial numerical properties. Basically it is a matter 
of deciding which features or cameras we care most about and tying the gauge to some 
stable average of them, so that gauge-induced correlations in them are as small as possible. 
For object reconstruction the resulting gauge will usually be object-centred, for vehicle 
navigation camera-centred. We stress that such choices are only a matter of superhcial 
appearance : in principle, all gauges are equivalent and give identical values and covariances 
for all gauge invariants. 

Another way to say this is that it is only for gauge invariants that we can find meaningful 
(coordinate system independent) values and covariances. But one of the most fruitful ways 
to create invariants is to locate features w.r.t. a basis of reference features, i.e. w.r.t. the 
gauge based on them. The choice of inner constraints is thus a choice of a stable basis 
of compound features w.r.t. which invariants can be measured. By including an average 
of many features in the compound, we reduce the invariants’ dependence on the basis 
features. 

As a performance criterion we can minimize some sort of weighted average size, either 
of the state update or of the covariance. Let W be an rix x rix information-like weight matrix 
encoding the relative importance of the various error components, and L be any left square 
root for it, L = W. The local gauge at X that minimizes the weighted size of the state 
update (5Xg W SXg, the weighted covariance sum Trace(W \/g) = Trace(L^ Vg L), and the 
L 2 or Frobenius norm of\J Vq L, is given by the inner constraints [87, 89, 6, 22, 25]'® : 

D(5x = 0 where D = G^W (38) 

The corresponding covariance \!g is given by (30) with D = W, and the state update is 
6Xg = -Vg g as usual. Also, if W is nonsingular, Vg is given by the weighted rank rix — ng 
pseudo-inverse H where W = L is the Cholesky decomposition of 

W and (-)^ is the Moore-Penrose pseudo-inverse. 

Sketch proof : For \N — ^ (whence L = 1 ) and diagonal H = ( q ) , we have G = ( ^ ) 
and g = ( Q ) as G = 0. Any gauge Q transversal to G has the form D = (— B C) with 
nonsingular C. Premultiplying by C ^ reduces D to the form D = (— B 1) for some rig x (rix — rig) 
matrix B. It follows that Pg = ( 0 q ) and Vg = ( 0 ) (1 B^), whence <5Xg W Sxg = 

g^VgWVgg = g'‘"A~'(l -f B^B) A^^g' and Trace (Vg) = Trace(A-^) -f Trace(B A^^B^). 
Both criteria are clearly minimized by taking B = 0, so D = (0 1 ) = G^ W as claimed. For 
nonsingular W = L L^, scaling the coordinates by X — >■ Lx reduces us to W — >■ 1 , g^ — >■ g^L^^ 
and H — > H Eigen-decomposition then reduces us to diagonal H. Neither transformation 

affects SXg W Sxg or Trace(WVg), and back substituting gives the general result. For singular 
W, use a limiting argument on D = G^ W. Similarly, using Vg as above, B — >■ 0, and hence the 
inner constraint, minimizes the L 2 and Frobenius norms of Vg L. Indeed, by the interlacing 
property of eigenvalues [44, §8.1], B — >■ 0 minimizes any strictly non-decreasing rotationally 
invariant function of Vg L (i.e. any strictly non-decreasing function of its eigenvalues). □ 
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The inner constraints are covariant under global transformations X — > t(x) provided 
that W is transformed in the usual information matrix / Hessian way W — > W T where 
T = ^ However, such transformations seldom preserve the form of W (diagonality, 
W = 1, etc.). If W represents an isotropic weighted sum over 3D points'^, its form is 
preserved under global 3D Euclidean transformations, and rescaled under scalings. But 
this extends neither to points under projective transformations, nor to camera poses, 3D 
planes and other non-point-like features even under Euclidean ones. (The choice of origin 
has a significant influence For poses, planes, etc. : changes of origin propagate rotational 
uncertainties into translational ones). 

Inner constraints were originally introduced in geodesy in the case W = 1 [87]. The 
meaning of this is entirely dependent on the chosen 3D coordinates and variable scaling. 
In bundle adjustment there is little to recommend W = 1 unless the coordinate origin has 
been carefully chosen and the variables carefully pre-scaled as above, i.e. X L^X and 
hence H — H where W ~ L is a fixed weight matrix that takes account of the 
fact that the covariances of features, camera translations and rotations, focal lengths, aspect 
ratios and lens distortions, all have entirely different units, scales and relative importances. 
Eor W = 1 , the gauge projection Pg becomes orthogonal and symmetric. 

9.4 Free Networks 

Gauges can be divided roughly into outer gauges, which are locked to predefined external 
reference features giving a fixed network adjustment, and inner gauges, which are locked 
only to the recovered structure giving a free network adjustment. (If their weight W is 
concentrated on the external reference, the inner constraints give an outer gauge). As 
above, well-chosen inner gauges do not distort the intrinsic covariance structure so much 
as most outer ones, so they tend to have better numerical conditioning and give a more 
representative idea of the true accuracy of the network. It is also useful to make another, 
slightly different fixed / free distinction. In order to control the gauge deficiency, any 
gauge fixing method must at least specify which motions are locally possible at each 
iteration. However, it is not indispensable for these local decisions to cohere to enforce 
a global gauge. A method is globally fixed if it does enforce a global gauge (whether 
inner or outer), and globally free if not. For example, the standard photogrammetric inner 
constraints [87, 89, 22, 25] give a globally free inner gauge. They require that the cloud of 
reconstructed points should not be translated, rotated or rescaled under perturbations (i.e. 
the centroid and average directions and distances from the centroid remain unchanged). 
However, they do not specify where the cloud actually is and how it is oriented and scaled, 
and they do not attempt to correct for any gradual drift in the position that may occur during 
the optimization iterations, e.g. owing to accumulation of truncation errors. In contrast, 
McLauchlan globally fixes the inner gauge by locking it to the reconstructed centroid 
and scatter matrix [82, 81]. This seems to give good numerical properties (although more 
testing is required to determine whether there is much improvement over a globally free 

G — > T G implies that D — >• D T“\ whence Vg — > T Vg T^, Pg — >• T Pg T”"^, and Sxg — >• T Sxg . 
So SXg W Sxg and Trace) W Vg) are preserved. 

** This means that it vanishes identically for all non-point features, camera parameters, etc., and is 
a weighted identity matrix Wi = Wi /sxs for each 3D point, or more generally it has the form 
W (g) /sx3 on the block of 3D point coordinates, where W is some ripomts x npomts inter-point 
weighting matrix. 
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inner gauge), and it has the advantage of actually fixing the coordinate system so that 
direct comparisons of solutions, covariances, etc., are possible. Numerically, a globally 
fixed gauge can be implemented either by including the ‘d’ term in (36), or simply by 
applying a rectifying gauge transformation to the estimate, at each step or when it drifts 
too far from the chosen gauge. 

9.5 Implementation of Gauge Constraints 

Given that all gauges are in principle equivalent, it does not seem worthwhile to pay a 
high computational cost for gauge fixing during step prediction, so methods requiring 
large dense factorizations or (pseudo-)inverses should not be used directly. Instead, the 
main computation can be done in any convenient, low cost gauge, and the results later 
transformed into the desired gauge using the gauge projector'® Pg = 1 — G (D D. 
It is probably easiest to use a trivial gauge for the computation. This is simply a matter 
of deleting the rows and columns of g, H corresponding to rig preselected parameters, 
which should be chosen to give a reasonably well-conditioned gauge. The choice can be 
made automatically by a subset selection method (c./., e.g. [11]). H is left intact and 
factored as usual, except that the final dense (owing to fill-in) submatrix is factored using a 
stable pivoted method, and the factorization is stopped rig columns before completion. The 
remaining rig x rig block (and the corresponding block of the forward-substituted gradient 
g) should be zero owing to gauge dehciency. The corresponding rows of the state update 
are set to zero (or anything else that is wanted) and back-substitution gives the remaining 
update components as usual. This method effectively finds the rig parameters that are least 
well constrained by the data, and chooses the gauge constraints that freeze these by setting 
the corresponding 5Xg components to zero. 



10 Quality Control 

This section discusses quality control methods for bundle adjustment, giving diagnostic 
tests that can be used to detect outliers and characterize the overall accuracy and reliability 
of the parameter estimates. These techniques are not well known in vision so we will go 
into some detail. Skip the technical details if you are not interested in them. 

Quality control is a serious issue in measurement science, and it is perhaps here that 
the philosophical differences between photogrammetrists and vision workers are greatest: 
the photogrammetrist insists on good equipment, careful project planning, exploitation 
of prior knowledge and thorough error analyses, while the vision researcher advocates a 
more casual, flexible ‘point-and-shoot’ approach with minimal prior assumptions. Many 
applications demand a judicious compromise between these virtues. 

A basic maxim is “quality = accuracy + reliability”^". The absolute accuracy of the 
system depends on the imaging geometry, number of measurements, etc. But theoretical 

The projector Pg itself is never calculated. Instead, it is applied in pieces, multiplying by D, etc. 
The gauged Newton step Sxg is easily found like this, and selected blocks of the covariance 
Vg = Pg \lgi Pg can also be found in this way, expanding Pg and using (53) for the leading term, 
and for the remaining ones finding D^, etc., by forwards substitution. 

‘Accuracy’ is sometimes called ‘precision’ in photogrammetry, but we have preferred to retain 
the familiar meanings from numerical analysis: ‘precision’ means numerical error / number of 
working digits and ‘accuracy’ means statistical error / number of significant digits. 
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precision by itself is not enough: the system must also be reliable in the face of out- 
liers, small modelling errors, and so forth. The key to reliability is the intelligent use of 
redundancy: the results should represent an internally self-consistent consensus among 
many independent observations, no aspect of them should rely excessively on just a few 
observations. 

The photogrammetric literature on quality control deserves to be better known in vision, 
especially among researchers working on statistical issues. Forstner [33, 34] and Griin 
[49, 50] give introductions with some sobering examples of the effects of poor design. 
See also [7,8,21,22]. All of these papers use least squares cost functions and scalar 
measurements. Our treatment generalizes this to allow robust cost functions and vector 
measurements, and is also slightly more self-consistent than the traditional approach. The 
techniques considered are useful for data analysis and reporting, and also to check whether 
design requirements are realistically attainable during project planning. Several properties 
should be verified. Internal reliability is the ability to detect and remove large aberrant 
observations using internal self-consistency checks. This is provided by traditional outlier 
detection and/or robust estimation procedures. External reliability is the extent to which 
any remaining Mndetected outliers can affect the estimates. Sensitivity analysis gives 
useful criteria for the quality of a design. Finally, model selection tests attempt to decide 
which of several possible models is most appropriate and whether certain parameters can 
be eliminated. 



10.1 Cost Perturbations 

We start by analyzing the approximate effects of adding or deleting an observation, which 
changes the cost function and hence the solution. We will use second order Taylor expansion 
to characterize the effects of this. Let f_(x) and f+(x) = f_(x) -|- <jf(x) be respectively 
the total cost functions without and with the observation included, where i5f(x) is the cost 
contribution of the observation itself. Let g± , <5g be the gradients and H± , <5H the Hessians 
of , bf. Let Xq be the unknown true underlying state and X± be the minima of f± (x) (i.e. 
the optimal state estimates with and without the observation included). Residuals at Xq 
are the most meaningful quantities for outlier decisions, but Xq is unknown so we will be 
forced to use residuals at X± instead. Unfortunately, as we will see below, these are biased. 
The bias is small for strong geometries but it can become large for weaker ones, so to 
produce uniformly reliable statistical tests we will have to correct for it. The fundamental 
result is : For any sufficiently well behaved cost function, the difference in fitted residuals 
f+(x_|_) — f_(x_) is asymptotically an unbiased and accurate estimate o/bf(Xo) : 

<5f(Xo) « f+(x+) - f_(x^) + u, V ^ 0{\\8q\\/^Jnz - nfj , (u) 0 (39) 



Sketch proof: From the Newton steps <5x± = x± — Xq ~ — H^^g±(Xo) at Xq, we find 
that f±(x±) — f±(Xo) « — |5x±H±5x± and hence u = f+(x+) — f-(x_) — i5f(Xo) « 
I (5xL H_ Sx- - 8x\ H+ <5x+). u is unbiased to relatively high order: by the central limit 
property of ML estimators, the asymptotic distributions of 5x± are Gaussian A/”(0, H ])), so the 
expectation of both dxj- H± dx± is asymptotically the number of free model parameters rix. 
Expanding <5x± and using 9+ = 9- -f <^9, the leading term is « — <5g(Xo)^X_, which 
asymptotically has normal distribution u ~ A/”(0, 5g(Xo)^ dg(Xo)) with standard deviation 
of order C7(||<5g||/x/n-z — n-x), as X_ ~ A/”(0, HA) and || H_ || ~ C7(riz — n-x). □ 
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Note that by combining values at two known evaluation points X±, we simulate a value at 
a third unknown one Xq. The estimate is not perfect, but it is the best that we can do in the 
circumstances. 

There are usually many observations to test, so to avoid having to refit the model many 
times we approximate the effects of adding or removing observations. Working at X± and 
using the fact that g± (x± ) = 0, the Newton step <5x = x+ — x_ « — ^g(x± ) implies 

a change in fitted residual of : 

f+(x+)-f-(x-) « 5f(x±)± iJx^H^^x 
= (5f(x±) ± iJg(x±)^H;l^g(x±) 

So (5f(x+) systematically underestimates f+(x+) — f_(x_) and hence <5f(Xo) by about 
i^X^ H_ 6x, and <5f(x_) overestimates it by about |^X^ H+ Sx. These biases are of order 
0(1/ {uz — nx)) and hence negligible when there is plenty of data, but they become large 
at low redundancies. Intuitively, including (5f improves the estimate on average, bringing 
about a ‘good’ reduction of 5f , but it also overfits i5f slightly, bringing about a further ‘bad’ 
reduction. Alternatively, the reduction in i5f on moving from X_ to X+ is bought at the cost 
of a slight increase in f_ (since X_ was already the minimum of f_), which should morally 
also be ‘charged’ to 5f. 

When deleting observations, we will usually have already evaluated Hi/ (or a corre- 
sponding factorization of H_|_) to find the Newton step near X_|_, whereas (40) requires Hl\ 
And vice versa for addition. Provided that 5H <C H, it is usually sufficient to use in 
place of Hi/ in the simple tests below. However if the observation couples to relatively few 
state variables, it is possible to calculate the relevant components of Hi/ fairly economi- 
cally.If ‘*’ means ‘select the fc variables on which 5H,^g are non-zero’, then <5g^H’^<5g = 

(<5g*)'"(H-i)*^g* and^^ (H/1)* = (((H±")*)“' t « (H^))* ± (H^))* ^H* (Hj))*. 

Even without the approximation, this involves at most a k x k factorization or inverse. 
Indeed, for least squares ^H is usually of even lower rank (= the number of independent 
observations in i5f), so the Woodbury formula (18) can be used to calculate the inverse even 
more efficiently. 

10.2 Inner Reliability and Outlier Detection 

In robust cost models nothing special needs to be done with outliers — they are just 
normal measurements that happen be downweighted owing to their large deviations. But 
in non-robust models such as least squares, explicit outlier detection and removal are 
essential for inner reliability. An effective diagnostic is to estimate <5f(Xo) using (39,40), 
and significance-test it against its distribution under the null hypothesis that the observation 
is an inlier. For the least squares cost model, the null distribution of 2 i5f(Xo) is Xk where 
k is the number of independent observations contributing to <5f. So if a is a suitable xt 
significance threshold, the typical one-sided significance test is : 

a < 2 (f(x+) -f(x_)) « 2 5f(x±) ± ^g(x±)^H;lJg(x±) (41) 

« Az,(x±)" (W, ± W, J/H/IJ, W,) Az,(x±) (42) 

C.f. the lower right corner of (17), where the components correspond to block 2, so that 
((Hi/)*) is ‘D 2 ’, the Schur complement of the remaining variables in H±. Adding 5H* changes 
the ‘D’ term but not the Schur complement correction. 
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As usual we approximate and use X_ results for additions and X+ ones for 

deletions. These tests require the fitted covariance matrix (or, if relatively few tests 
will be run, an equivalent factorization of H±), but given this they are usually fairly 
economical owing to the sparseness of the observation gradients <5g(x±). Equation (42) 
is for the nonlinear least squares model with residual error AZj(x) = Zj — Zi(x), cost 
|AZj(x)^ Wi AZj(x) and Jacobian Ji = Note that even though Zj induces a change in 
all components of the observation residual Az via its influence on <5x, only the immediately 
involved components Az^ are required in (42). The bias-correction-induced change of 
weight matrix ± accounts for the others. For non-quadratic 

cost functions, the above framework still applies but the cost function’s native distribution 
of negative log likelihood values must be used instead of the Gaussian’s 

In principle, the above analysis is only valid when at most one outlier causes a relatively 
small perturbation <5x. In practice, the observations are repeatedly scanned for outliers, at 
each stage removing any discovered outliers (and perhaps reinstating previously discarded 
observations that have become inliers) and refitting. The net result is a form of M-estimator 
routine with an abruptly vanishing weight function: outlier deletion is just a roundabout 
way of simulating a robust cost function. (Hard inlier/outlier rules correspond to total 
likelihood functions that become strictly constant in the outlier region). 

The tests (4 1 , 42) give what is needed for outlier decisions based on fitted state estimates 
X± , but for planning purposes it is also useful to know how large a gross error must typically 
be w.r.t. the true state Xq before it is detected. Outlier detection is based on the uncertain 
htted state estimates, so we can only give an average case result. No adjustment for X± is 
needed in this case, so the average minimum detectable gross error is simply : 

a < 2(5f(Xo) « Az(Xo)^W Az(Xo) (43) 



10.3 Outer Reliability 

Ideally, the state estimate should be as insensitive as possible to any remaining errors in 
the observations. To estimate how much a particular observation influences the final state 
estimate, we can directly monitor the displacement 6x = x+-x_ « H^'5g±(x±). 
For example, we might define an importance weighting on the state parameters with a 
criterion matrix U and monitor absolute displacements ||U<5x|| « ||U Hij(^g(x±)||, or 
compare the displacement ^X to the covariance H of X± by monitoring ^X^ Hip 6x « 
^g±(x±)^ <^g±(x±)- Abound on <5g(x±) of the form^^ ^g <5g^ ^ V for some positive 

semidefinite V implies a bound <5x ^X^ V on <5x and hence a bound || U ^x|| " < 

AT(UH;iVH;iU^ ) where fff) can be L 2 norm, trace or Frobenius norm. For a robust 

This is a convenient intermediate form for deriving bounds. For positive semidefinite matri- 
ces A, B, we say that B dominates A, B A, if B — A is positive semidefinite. It follows 
that AfjU A U^) < AfjU B U^) for any matrix U and any matrix function N{-) that is non- 
decreasing under positive additions. Rotationally invariant non-decreasing functions A/”(- ) include 
all non-decreasing functions of the eigenvalues, e.g. L2 norm max Ai, trace Ai, Frobenius norm 
\/f2 For a vector a and positive B, a^B a < fc if and only if a a^ R fc B^^. (Proof: Conju- 
gate by B^^^ and then by a (B^'^^ a)-reducing Householder rotation to reduce the question to the 
equivalence of 0 R Diag (k — u^, k, . . . , fc) and < k, where = ||B^A a||^). Bounds of 
the form ||Ua||^ < fc A/’jU B““^ U^) follow for any U and any A/”)-) for which Afjvv^) = ||v|p, 
e.g. L2 norm, trace, Frobenius norm. 
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cost model in which <5g is globally bounded, this already gives asymptotic bounds of 
order 0(||H“^|| ||<5g||) ^ 0{\\5Q\\/y/nz — fiy) for the state perturbation, regardless of 
whether an outlier occurred. For non-robust cost models we have to use an inlier criterion 
to limit ^g. For the least sc^uares observation model with rejection test (42), Az Az^ ^ 
a (Wi ± Wi JJ Hjjl Ji Wi) and hence the maximum state perturbation due to a declared- 
inlying observation Z^ is : 

= a (Hl^ - (44) 

« (45) 

so, e.g., < aTrace(Ji and ||U5x|p < aTrace 

Ji U JJ where is the nominal covariance of Zj. Note that these bounds 

are based on changes in the estimated state X± . They do not directly control perturbations 
w.r.t. the true one Xq. The combined influence of several (A: <C riz — Uy) observations is 
given by summing their <5g’s. 

10.4 Sensitivity Analysis 

This section gives some simple figures of merit that can be used to quantify network 
redundancy and hence reliability. First, in (5f(Xg) « '^f(x+) + ^ <5g(X4.)^ Hl^ <5g(x_|_), 
each cost contribution <5f(Xo) is split into two parts : the visible residual (5f (x_|_) at the fitted 
state X+ ; and ^ Sx^ H_ ^X, the change in the base cost f_ (x) due to the state perturbation 
5x = ^g(x+) induced by the observation. Ideally, we would like the state perturbation 

to be small (for stability) and the residual to be large (for outlier detectability). In other 
words, we would like the following masking factor to be small (rrii <C 1) for each 
observation: 

^ ^g(x+rHl^.5g(x+) 

~ 2<5f(x+) + ^g(x+)^HiMg(x+) 

^ Az,(x+)"W,J,H-jJIW,Az,(x+) 

Az,(x+)- (W, + W, J, HiM-W,) Az,(x+) 

(Here, <5f should be normalized to have minimum value 0 for an exact fit). If rrii is known, 
the outlier test becomes (5f(x_|_)/(l — rrii) > o;. The masking rrii depends on the relative 
size of 5g and <5f, which in general depends on the functional form of i5f and the specific 
deviation involved. For robust cost models, a bound on <5g may be enough to bound rrii 
for outliers. However, for least squares case (Az form), and more generally for quadratic 
cost models (such as robust models near the origin), rrii depends only on the direction of 
AZj, not on its size, and we have a global L 2 matrix norm based bound rrii < where 
u = ||L^ Ji JJ I-II 2 < Trace (Ji Hl^ J]^ W) and L = Wi is a Cholesky decomposition 
of Wi. (These bounds become equalities for scalar observations). 

The stability of the state estimate is determined by the total cost Hessian (information 
matrix) H. A large H implies a small state estimate covariance and also small responses 
^x « —H Mg to cost perturbations ^g. The sensitivity numbers Si = Trace (H^^<5Hi) 
are a useful measure of the relative amount of information contributed to H+ by each 
observation. They sum to the model dimension — J2i ^i = tiy because 5Hi = H+ 



(46) 

(47) 
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— so they count “how many parameters worth” of the total information the observation 
contributes. Some authors prefer to quote redundancy numbers = rii — Si, where 
rii is the effective number of independent observations contained in Z^. The redundancy 
numbers sum to Uz — nx, the total redundancy of the system. In the least squares case, 
Si = Trace (Ji Hljl JJ W) and nii = Si for scalar observations, so the scalar outlier test 
becomes Si{X-^)/ri > a. Sensitivity numbers can also be dehned for subgroups of the 
parameters in the form Trace(U ^H), where U is an orthogonal projection matrix that 
selects the parameters of interest. Ideally, the sensitivities of each subgroup should be 
spread evenly across the observations : a large Si indicates a heavily weighted observation, 
whose incorrectness might signihcantly compromise the estimate. 



10.5 Model Selection 



It is often necessary to chose between several alternative models of the cameras or scene, 
e.g. additional parameters for lens distortion, camera calibrations that may or may not have 
changed between images, coplanarity or non-coplanarity of certain features. Over-special 
models give biased results, while over-general ones tend to be noisy and unstable. We 
will consider only nested models, for which a more general model is specialized to a 
more specihc one by freezing some of its parameters at default values (e.g. zero skew or 
lens distortion, equal calibrations, zero deviation from a plane). Let: X be the parameter 
vector of the more general model; f(x) be its cost function; c(x) = 0 be the parameter 
freezing constraints enforcing the specialization ; k be the number of parameters frozen ; Xq 
be the true underlying state ; Xg be the optimal state estimate for the general model (i.e. the 
unconstrained minimum of f(x)); and Xg be the optimal state estimate for the specialized 
one (i.e. the minimum of f(x) subject to the constraints c(x) = 0). Then, under the 
null hypothesis that the specialized model is correct, c(Xo) = 0, and in the asymptotic 
limit in which Xg — Xg and Xg — Xg become Gaussian and the constraints become locally 
approximately linear across the width of this Gaussian, the difference in htted residuals 
2 (f(Xg) — f(Xg)) has a xl distribution^"^. So if 2 (f(Xg) — f(Xg)) is less than some suitable 
xl- decision threshold a, we can accept the hypothesis that the additional parameters take 
their default values, and use the specialized model rather than the more general one^^. 

As before, we can avoid fitting one of the models by using a linearized analysis. First 
suppose that we start with a fit of the more general model Xg. Let the linearized constraints 
at Xg be c(Xg + Sx) « c(Xg) + Cdx, where C = A straightforward Lagrange 
multiplier calculation gives ; 



2(f(Xg)-f(x,)) « c(x,)-(CH-CTc(x,) 

Xg ^ Xg- H-^C^ o{Xg) 



(48) 



Conversely, starting from a ht of the more specialized model, the unconstrained minimum is 
given by the Newton step: Xg « Xg — H“^g(Xg),and2 (f(Xg) — f(Xg)) « g(Xg)^ g(Xg), 
where g(Xg) is the residual cost gradient at Xg. This requires the general-model covariance 

This happens irrespective of the observation distributions because — unlike the case of adding 
an observation — the same observations and cost function are used for both fits. 

In practice, small models are preferable as they have greater stability and predictive power and 
less computational cost. So the threshold a is usually chosen to be comparatively large, to ensure 
that the more general model will not be chosen unless there is strong evidence for it. 
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(or an equivalent factorization of H), which may not have been worked out. Suppose 
that the additional parameters were simply appended to the model, X — ^ (x, y) where X is 
now the reduced parameter vector of the specialized model and y contains the additional 
parameters. Let the general-model cost gradient at (Xg, yg) be ( ^ ) where h = and 

its Hessian be ( ^ ^ . A straightforward calculation shows that: 

2 (f(Xg,yg) -f(Xg,yg)) « (B - A h 

( ^: ) « ( ^: ) + ( ) (B - A H- A-)- h 

Given or an equivalent factorization of H, these tests are relatively inexpensive for 
small k. They amount respectively to one step of Sequential Quadratic Programming and 
one Newton step, so the results will only be accurate when these methods converge rapidly. 

Another, softer, way to handle nested models is to apply a prior (5fprior(x) peaked at the 
zero of the specialization constraints c(x). If this is weak the data will override it when 
necessary, but the constraints may not be very accurately enforced. If it is stronger, we 
can either apply an ‘outlier’ test (39,41) to remove it if it appears to be incorrect, or use 
a sticky prior — a prior similar to a robust distribution, with a concentrated central peak 
and wide flat tails, that will hold the estimate near the constraint surface for weak data, but 
allow it to ‘unstick’ if the data becomes stronger. 

Finally, more heuristic rules are often used for model selection in photogrammetry, 
for example deleting any additional parameters that are excessively correlated (correlation 
coefficient greater than ~ 0.9) with other parameters, or whose introduction appears to 
cause an excessive increase in the covariance of other parameters [49, 50]. 

11 Network Design 

Network design is the problem of planning camera placements and numbers of images 
before a measurement project, to ensure that sufficiently accurate and reliable estimates 
of everything that needs to be measured are found. We will not say much about design, 
merely outlining the basic considerations and giving a few useful rules of thumb. See [5, 
chapter 6], [79, 78], [73, Vol.2 §4] for more information. 

Factors to be considered in network design include : scene coverage, occlusion / vis- 
ibility and feature viewing angle; held of view, depth of held, resolution and workspace 
constraints ; and geometric strength, accuracy and redundancy. The basic quantitative aids 
to design are covariance estimation in a suitably chosen gauge (see §9) and the quality 
control tests from §10. Expert systems have been developed [79], but in practice most 
designs are still based on personal experience and rules of thumb. 

In general, geometric stability is best for ‘convergent’ (close-in, wide baseline, high 
perspective) geometries, using wide angle lenses to cover as much of the object as possible, 
and large him or CCD formats to maximize measurement precision. The wide coverage 
maximizes the overlap between different sub-networks and hence overall network rigidity, 
while the wide baselines maximize the sub-network stabilities. The practical limitations 
on closeness are workspace, held of view, depth of held, resolution and feature viewing 
angle constraints. 

Maximizing the overlap between sub-networks is very important. For objects with 
several faces such as buildings, images should be taken from corner positions to tie the 
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face sub-networks together. For large projects, large scale overview images can be used to 
tie together close-in densifying ones. When covering individual faces or surfaces, overlap 
and hence stability are improved by taking images with a range of viewing angles rather than 
strictly fronto-parallel ones (e.g., for the same number of images, pan-move-pan-move or 
interleaved left-looking and right-looking images are stabler than a simple fronto-parallel 
track). Similarly, for buildings or turntable sequences, using a mixture of low and high 
viewpoints helps stability. 

For reliability, one usually plans to see each feature point in at least four images. 
Although two images in principle suffice for reconstruction, they offer little redundancy 
and no resistance against feature extraction failures. Even with three images, the internal 
reliability is still poor: isolated outliers can usually be detected, but it may be difficult to 
say which of the three images they occurred in. Moreover, 3^ image geometries with 
widely spaced (i.e. non-aligned) centres usually give much more isotropic feature error 
distributions than two image ones. 

If the bundle adjustment will include self-calibration, it is important to include a range 
of viewing angles. For example for a flat, compact object, views might be taken at regularly 
spaced points along a 30^5° half-angle cone centred on the object, with 90° optical axis 
rotations between views. 

12 Summary and Recommendations 

This survey was written in the hope of making photogrammetric know-how about bundle 
adjustment — the simultaneous optimization of structure and camera parameters in visual 
reconstruction — more accessible to potential Implementors in the computer vision com- 
munity. Perhaps the main lessons are the extraordinary versatility of adjustment methods, 
the critical Importance of exploiting the problem structure, and the continued dominance 
of second order (Newton) algorithms, in spite of all efforts to make the simpler hrst order 
methods converge more rapidly. 

We will hnish by giving a series of recommendations for methods. At present, these 
must he regarded as very provisional, and subject to revision after further testing. 
Parametrization : (§2.2, 4.5) During step prediction, avoid parameter singularities, inhni- 
ties, strong nonlinearities and ill-conditioning. Use well-conditioned local (current value 
+ offset) parametrizations of nonlinear elements when necessary to achieve this : the local 
step prediction parametrization can be different from the global state representation one. 
The ideal is to make the parameter space error function as isotropic and as near-quadratic 
as possible. Residual rotation or quaternion parametrizations are advisable for rotations, 
and projective homogeneous parametrizations for distant points, lines and planes (i.e. 3D 
features near the singularity of their affine parametrizations, affine infinity). 

Cost function: (§3) The cost should be a realistic approximation to the negative log 
likelihood of the total (inlier + outlier) error distribution. The exact functional form of the 
distribution is not too critical, however: (i) Undue weight should not be given to outliers 
by making the tails of the distribution (the predicted probability of outliers) unrealistically 
small. (NB : Compared to most real-world measurement distributions, the tails of a Gaussian 
are unrealistically small), (ii) The dispersion matrix or inlier covariance should be a realistic 
estimate of the actual inlier measurement dispersion, so that the transition between inliers 
and outliers is in about the right place, and the inlier errors are correctly weighted during 
htting. 
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Optimization method : (§4, 6, 7) For batch problems use a second order Gauss-Newton 
method with sparse factorization (see below) of the Hessian, unless : 

• The problem is so large that exact sparse factorization is impractical. In this case consider 
either iterative linear system solvers such as Conjugate Gradient for the Newton step, 
or related nonlinear iterations such as Conjugate Gradient, or preferably Limited Mem- 
ory Quasi-Newton or (if memory permits) full Quasi-Newton (§7, [29,93,42]). (None 
of these methods require the Hessian). If you are in this case, it would pay to investi- 
gate professional large-scale optimization codes such as MINPACK-2, LANCELOT, or 
commercial methods from NAG or IMSL (see §C.2). 

• If the problem is medium or large but dense (which is unusual), and if it has strong 
geometry, alternation of resection and intersection may be preferable to a second order 
method. However, in this case Successive Over-Relaxation (SOR) would be even better, 
and Conjugate Gradient is likely to be better yet. 

• In all of the above cases, good preconditioning is critical (§7.3). 

For on-line problems (rather than batch ones), use factorization updating rather than 
matrix inverse updating or re-factorization (§B.5). In time-series problems, investigate the 
effect of changing the time window (§8.2, [83, 84]), and remember that Kalman filtering 
is only the first half-iteration of a full nonlinear method. 

Factorization method: (§6.2, B.l) For speed, preserve the symmetry of the Hessian dur- 
ing factorization by using: Cholesky decomposition for positive definite Hessians (e.g. 
unconstrained problems in a trivial gauge) ; pivoted Cholesky decomposition for positive 
semi-definite Hessians (e.g. unconstrained problems with gauge fixing by subset selec- 
tion §9.5); and Bunch-Kauffman decomposition (§B.l) for indefinite Hessians (e.g. the 
augmented Hessians of constrained problems, §4.4). Gaussian elimination is stable but a 
factor of two slower than these. 

Variable ordering: (§6.3) The variables can usually be ordered by hand for regular net- 
works, but for more irregular ones (e.g. close range site-modelling) some experimentation 
may be needed to find the most efficient overall ordering method. If reasonably compact 
profiles can be found, profile representations (§6.3, B. 3) are simpler to implement and 
faster than general sparse ones (§6.3). 

• For dense networks use a profile representation and a “natural” variable ordering: either 
features then cameras, or cameras then features, with whichever has the fewest param- 
eters last. An explicit reduced system based implementation such as Brown’s method 
[19] can also be used in this case (§6.1, A). 

• If the problem has some sort of ID temporal or spatial structure (e.g. image streams, 
turntable problems), try a profile representation with natural (simple connectivity) or 
Snay’s banker’s (more complex connectivity) orderings (§6.3, [101,24]). A recursive 
on-line updating method might also be useful in this case. 

• If the problem has 2D structure (e.g. cartography and other surface coverage problems) 
try nested dissection, with hand ordering for regular problems (cartographic blocks), 
and a multilevel scheme for more complex ones (§6.3). A profile representation may or 
may not be suitable. 

• For less regular sparse networks, the choice is not clear. Try minimum degree ordering 
with a general sparse representation, Snay’s Banker’s with a profile representation, or 
multilevel nested dissection. 

For all of the automatic variable ordering methods, try to order any especially highly 
connected variables last by hand, before invoking the method. 
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Gauge fixing: (§9) For efficiency, use either a trivial gauge or a subset selection method as 
a working gauge for calculations, and project the results into whatever gauge you want later 
by applying a suitable gauge projector Pg (32). Unless you have a strong reason to use an 
external reference system, the output gauge should probably be an inner gauge centred on 
the network elements you care most about, i.e. the observed features for a reconstruction 
problem, and the cameras for a navigation one. 

Quality control and network design: (§10) A robust cost function helps, but for overall 
system reliability you still need to plan your measurements in advance (until you have 
developed a good intuition for this), and check the results afterwards for outlier sensitivity 
and over-modelling, using a suitable quality control procedure. Do not underestimate the 
extent to which either low redundancy, or weak geometry, or over-general models can 
make gross errors undetectable. 



A Historical Overview 

This appendix gives a brief history of the main developments in bundle adjustment, in- 
cluding literature references. 

Least squares : The theory of combining measurements by minimizing the sum of their 
squared residuals was developed independently by Gauss and Legendre around 1795-1820 
[37, 74], [36, Vol.IV, 1-93], about 40 years after robust Li estimation [15]. Least squares 
was motivated by estimation problems in astronomy and geodesy and extensively applied 
to both fields by Gauss, whose remarkable 1823 monograph [37,36] already contains 
almost the complete modern theory of least squares including elements of the theory of 
probability distributions, the definition and properties of the Gaussian distribution, and a 
discussion of bias and the “Gauss-Markov” theorem, which states that least squares gives 
the Best Linear Unbiased Estimator (BLUE) [37, 1 1]. It also introduces the LDL^ form of 
symmetric Gaussian elimination and the Gauss-Newton iteration for nonlinear problems, 
essentially in their modern forms although without explicitly using matrices. The 1828 
supplement on geodesy introduced the Gauss-Seidel iteration for solving large nonlinear 
systems. The economic and military importance of surveying lead to extensive use of least 
squares and several further developments : Helmert’s nested dissection [64] — probably the 
first systematic sparse matrix method — in the 1880’s, Cholesky decomposition around 
1915, Baarda’s theory of reliability of measurement networks in the 1960’s [7,8], and 
Meissl [87, 89] and Baarda’s [6] theories of uncertain coordinate frames and free networks 
[22, 25]. We will return to these topics below. 

Second order bundle algorithms: Electronic computers capable of solving reasonably 
large least squares problems first became available in the late 1950’s. The basic photogram- 
metric bundle method was developed for the U.S. Air Force by Duane C. Brown and his 
co-workers in 1957-9 [16, 19]. The initial focus was aerial cartography, but by the late 
1960’s bundle methods were also being used for close-range measurements^®. The links 
with geodesic least squares and the possibility of combining geodesic and other types 
of measurements with the photogrammetric ones were clear right from the start. Initially 

Close range means essentially that the object has significant depth relative to the camera distance, 
i.e. that there is significant perspective distortion. For aerial images the scene is usually shallow 
compared to the viewing height, so focal length variations are very difficult to disentangle from 
depth variations. 
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Fig. 9. A schematic history of bundle adjustment. 

the cameras were assumed to be calibrated^^, so the optimization was over object points 
and camera poses only. Self calibration (the estimation of internal camera parameters 
during bundle adjustment) was first discussed around 1964 and implemented by 1968 
[19]. Camera models were greatly refined in the early 1970’s, with the investigation of 
many alternative sets of additional (distortion) parameters [17-19]. Even with stable 
and carefully calibrated aerial photogrammetry cameras, self calibration significantly im- 
proved accuracies (by factors of around 2-10). This lead to rapid improvements in camera 
design as previously unmeasurable defects like film platten non-flatness were found and 
corrected. Much of this development was lead by Brown and his collaborators. See [19] 
for more of the history and references. 

Brown’s initial 1958 bundle method [16, 19] uses block matrix techniques to elimi- 
nate the structure parameters from the normal equations, leaving only the camera pose 
parameters. The resulting reduced camera subsystem is then solved by dense Gaussian 
elimination, and back-substitution gives the structure. For self-calibration, a second reduc- 
tion from pose to calibration parameters can be added in the same way. Brown’s method is 
probably what most vision researchers think of as ‘bundle adjustment’, following descrip- 
tions by Slama [100] and Hartley [58, 59]. It is still a reasonable choice for small dense 
networks^*, but it rapidly becomes inefficient for the large sparse ones that arise in aerial 
cartography and large-scale site modelling. 

For larger problems, more of the natural sparsity has to be exploited. In aerial cartog- 
raphy, the regular structure makes this relatively straightforward. The images are arranged 
in blocks — rectangular or irregular grids designed for uniform ground coverage, formed 
from parallel ID strips of images with about 50-70% forward overlap giving adjacent 
stereo pairs or triplets, about 10-20% side overlap, and a few known ground control points 



Calibration always denotes internal camera parameters (“interior orientation”) in photogram- 
metric terminology. External calibration is called pose or (exterior) orientation. 

A photogrammetric network is dense if most of the 3D features are visible in most of the images, 
and sparse if most features appear in only a few images. This corresponds directly to the density 
or sparsity of the off-diagonal block (feature-camera coupling matrix) of the bundle Hessian. 
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sprinkled sparsely throughout the block. Features are shared only between neighbouring 
images, and images couple in the reduced camera subsystem only if they share common 
features. So if the images are arranged in strip or cross-strip ordering, the reduced cam- 
era system has a triply-banded block structure (the upper and lower bands representing, 
e.g., right and left neighbours, and the central band forward and backward ones). Several 
efficient numerical schemes exist for such matrices. The hrst was Gyer & Brown’s 1967 re- 
cursive partitioning method [57, 19], which is closely related to Helmert’s 1880 geodesic 
method [64]. (Generalizations of these have become one of the major families of modern 
sparse matrix methods [40, 26, 1 1]). The basic idea is to split the rectangle into two halves, 
recursively solving each half and gluing the two solutions together along their common 
boundary. Algebraically, the variables are reordered into left-half-only, right-half-only and 
boundary variables, with the latter (representing the only coupling between the two halves) 
eliminated last. The technique is extremely effective for aerial blocks and similar problems 
where small separating sets of variables can be found. Brown mentions adjusting a block 
of 162 photos on a machine with only 8k words of memory, and 1000 photo blocks were 
already feasible by mid- 1967 [19]. For less regular networks such as site modelling ones 
it may not be feasible to choose an appropriate variable ordering beforehand, but efficient 
on-line ordering methods exist [40,26, 11] (see §6.3). 

Independent model methods: These approximate bundle adjustment by calculating a 
number of partial reconstructions independently and merging them by pairwise 3D align- 
ment. Even when the individual models and alignments are separately optimal, the result 
is suboptimal because the the stresses produced by alignment are not propagated back 
into the individual models. (Doing so would amount to completing one full iteration of 
an optimal recursive decomposition style bundle method — see §8.2). Independent model 
methods were at one time the standard in aerial photogrammetry [95,2, 100,73], where 
they were used to merge individual stereo pair reconstructions within aerial strips into 
a global reconstruction of the whole block. They are always less accurate than bundle 
methods, although in some cases the accuracy can be comparable. 

First order & approximate bundle algorithms : Another recurrent theme is the use of 
approximations or iterative methods to avoid solving the full Newton update equations. 
Most of the plausible approximations have been rediscovered several times, especially 
variants of alternate steps of resection (hnding the camera poses from known 3D points) and 
intersection (finding the 3D points from known camera poses), and the linearized version of 
this, the block Gauss-Seidel iteration. Brown’s group had already experimented with Block 
Successive Over-Relaxation (BSOR — an accelerated variant of Gauss-Seidel) by 1964 
[19], before they developed their recursive decomposition method. Both Gauss-Seidel and 
BSOR were also applied to the independent model problem around this time [95, 2]. These 
methods are mainly of historical interest. For large sparse problems such as aerial blocks, 
they can not compete with efficiently organized second order methods. Because some 
of the inter-variable couplings are ignored, corrections propagate very slowly across the 
network (typically one step per iteration), and many iterations are required for convergence 
(see §7). 

Quality control : In parallel with this algorithmic development, two important theoretical 
developments took place. Firstly, the Dutch geodesist W. Baarda led a long-running work- 
ing group that formulated a theory of statistical reliability for least squares estimation [7, 
8]. This greatly clarified the conditions (essentially redundancy) needed to ensure that 
outliers could be detected from their residuals (inner reliability), and that any remaining 
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undetected outliers had only a limited effect on the final results (outer reliability). A. Griin 
[49, 50] and W. Forstner [30, 33, 34] adapted this theory to photogrammetry around 1980, 
and also gave some early correlation and covariance based model selection heuristics de- 
signed to control over-htting problems caused by over-elaborate camera models in self 
calibration. 



Datum / gauge freedom: Secondly, as problem size and sophistication increased, it 
became increasingly difficult to establish sufficiently accurate control points for large 
geodesic and photogrammetric networks. Traditionally, the network had been viewed as 
a means of ‘densifying’ a fixed control coordinate system — propagating control-system 
coordinates from a few known control points to many unknown ones. But this viewpoint 
is suboptimal when the network is intrinsically more accurate than the control, because 
most of the apparent uncertainty is simply due to the uncertain definition of the control 
coordinate system itself. In the early 1960’s, Meissl studied this problem and developed the 
first free network approach, in which the reference coordinate system floated freely rather 
than being locked to any given control points [87, 89]. More precisely, the coordinates 
are pinned to a sort of average structure defined by so-called inner constraints. Owing 
to the removal of control-related uncertainties, the nominal structure covariances become 
smaller and easier to interpret, and the numerical bundle iteration also converges more 
rapidly. Later, Baarda introduced another approach to this theory based on S-transforms 
— coordinate transforms between uncertain frames [6, 21, 22, 25]. 



Least squares matching: All of the above developments originally used manually ex- 
tracted image points. Automated image processing was clearly desirable, but it only grad- 
ually became feasible owing to the sheer size and detail of photogrammetric images. Both 
feature based, e.g. [31, 32], and direct (region based) [1, 52, 55, 1 10] methods were studied, 
the latter especially for matching low-contrast natural terrain in cartographic applications. 
Both rely on some form of least squares matching (as image correlation is called in pho- 
togrammetry). Correlation based matching techniques remain the most accurate methods 
of extracting precise translations from images, both for high contrast photogrammetric 
targets and for low contrast natural terrain. Starting from around 1985, Griin and his co- 
workers combined region based least squares matching with various geometric constraints. 
Multi-photo geometrically constrained matching optimizes the match over multiple im- 
ages simultaneously, subject to the inter-image matching geometry [52,55,9]. For each 
surface patch there is a single search over patch depth and possibly slant, which simulta- 
neously moves it along epipolar lines in the other images. Initial versions assumed known 
camera matrices, but a full patch-based bundle method was later investigated [9]. Related 
methods in computer vision include [94, 98, 67]. Globally enforced least squares match- 
ing [53, 97, 76] further stabilizes the solution in low-signal regions by enforcing continuity 
constraints between adjacent patches. Patches are arranged in a grid and matched using 
local affine or projective deformations, with additional terms to penalize mismatching at 
patch boundaries. Related work in vision includes [104, 102]. The inter-patch constraints 
give a sparsely-coupled structure to the least squares matching equations, which can again 
be handled efficiently by recursive decomposition. 
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B Matrix Factorization 

This appendix covers some standard material on matrix factorization, including the tech- 
nical details of factorization, factorization updating, and covariance calculation methods. 
See [44, 11] for more details. 

Terminology: Depending on the factorization, ‘L’ stands for lower triangular, ‘U’ or ‘R’ 
for upper triangular, ‘D’ or ‘S’ for diagonal, ‘Q’ or ‘U’,‘V’ for orthogonal factors. 



B.l Triangular Decompositions 

Any matrix A has a family of block (lower triangular)*(diagonal)*(upper triangular) fac- 
torizations A= LDU: 
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Here, the diagonal blocks Di . . . Dr-i must be chosen to be square and invertible, and r 
is determined by the rank of A. The recursion (51) follows immediately from the product 
Aij = (L D \J)ij = J2k<min{i j) Dfc Ufej. Given such a factorization, linear equations 
can be solved by forwards and backwards substitution as in (22-24). 

The diagonal blocks of L, D, U can be chosen freely subject to La Du \Ju = Au, 
but once this is done the factorization is uniquely dehned. Choosing Lu = Du = 1 so 
that Du = Au gives the (block) LU decomposition A = L U, the matrix representation 
of (block) Gaussian elimination. Choosing Lu = Du = 1 so that = Au gives the 
LDU decomposition. If A is symmetric, the LDU decomposition preserves the symmetry 
and becomes the LDL^ decomposition A = L D where U = and D = D^. If A 
is symmetric positive definite we can set D = 1 to get the Cholesky decomposition 
A = L L^, where Lu LJ^ — Au (recursively) defines the Cholesky factor Lu of the positive 
dehnite matrix Au- (For a scalar, Chol(a) = y/a). If all of the blocks are chosen to be 
1 X 1, we get the conventional scalar forms of these decompositions. These decompositions 
are obviously equivalent, but for speed and simplicity it is usual to use the most specific 
one that applies: LU for general matrices, LDL^ for symmetric ones, and Cholesky for 
symmetric positive definite ones. For symmetric matrices such as the bundle Hessian, 
LDL^ / Cholesky are 1 .5-2 times faster than LDU / LU. We will use the general form (50) 
below as it is trivial to specialize to any of the others. 

Loop ordering: From (51), the ij block of the decomposition depends only on the the 
upper left (to — 1) x (m — 1) submatrix and the first to elements of row i and column j of 
A, where to = min(i, j). This allows considerable freedom in the ordering of operations 
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during decomposition, which can be exploited to enhance parallelism and improve memory 
cache locality. 

Fill in: If A is sparse, its L and U factors tend to become ever denser as the decomposition 
progresses. Recursively expanding and kkj in (51) gives contributions of the form 
±Aifc Aj,^^ kki ■ ■ ■ Apq Ag^ Aqj for k,l .. .p,q < min(i, j). So even if A^- is zero, if there 
is any path of the form via non-zero Am with 

k,l . . .p,q < min(i, j), the ij block of the decomposition will genetically fill-in (become 
non-zero). The amount of hll-in is strongly dependent on the ordering of the variables 
(i.e. of the rows and columns of A). Sparse factorization methods (§6.3) manipulate this 
ordering to minimize either fill-in or total operation counts. 

Pivoting: For positive definite matrices, the above factorizations are very stable because 
the pivots An must themselves remain positive dehnite. More generally, the pivots may 
become ill-conditioned causing the decomposition to break down. To deal with this, it is 
usual to search the undecomposed part of the matrix for a large pivot at each step, and 
permute this into the leading position before proceeding. The stablest policy is full pivoting 
which searches the whole submatrix, but usually a less costly partial pivoting search 
over just the current column (column pivoting) or row (row pivoting) suffices. Pivoting 
ensures that L and/or U are relatively well-conditioned and postpones ill-conditioning in 
D for as long as possible, but it can not ultimately make D any better conditioned than A is. 
Column pivoting is usual for the LU decomposition, but if applied to a symmetric matrix 
it destroys the symmetry and hence doubles the workload. Diagonal pivoting preserves 
symmetry by searching for the largest remaining diagonal element and permuting both 
its row and its column to the front. This suffices for positive semidefinite matrices (e.g. 
gauge deficient Hessians). For general symmetric indefinite matrices (e.g. the augmented 
Hessians ( q ) of constrained problems (12)), off-diagonal pivots can not be avoided^^, 
but there are fast, stable, symmetry-preserving pivoted LDL^ decompositions with block 
diagonal D having 1x1 and 2x2 blocks. Full pivoting is possible (Bunch-Parlett 
decomposition), but Bunch-Kaufman decomposition which searches the diagonal and 
only one or at most two columns usually suffices. This method is nearly as fast as pivoted 
Cholesky decomposition (to which it reduces for positive matrices), and as stable LU 
decomposition with partial pivoting. Asen’s method has similar speed and stability but 

( H C \ 

qt Q j has further special properties 
owing to its zero block, but we will not consider these here — see [44, §4.4.6 Equilibrium 
Systems]. 

B.2 Orthogonal Decompositions 

For least squares problems, there is an alternative family of decompositions based on 
orthogonal reduction of the Jacobian J = ^ • Given any rectangular matrix A, it can be 
decomposed as A = Q R where R is upper triangular and Q is orthogonal (i.e., its columns 
are orthonormal unit vectors). This is called the QR decomposition of A. R is identical 
to the right Cholesky factor of A^ A = (R^Q^)(Q R) = R^ R. The solution of the linear 

The archetypical failure is the unstable LDL^ decomposition of the well-conditioned symmetric 

indefinite matrix ( f g ) = ( \ j^ 1 ) ( 0 —1/e ) ( o e — >■ 0. Fortunately, for small 

diagonal elements, permuting the dominant off-diagonal element next to the diagonal and leaving 
the resulting 2x2 block undecomposed in D suffices for stability. 
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least squares problem minx ||Ax — b|pisx = b, and is the Moore-Penrose 

pseduo-inverse of A. The QR decomposition is calculated by finding a series of simple 
rotations that successively zero below diagonal elements of A to form R, and accumulating 
the rotations in Q, A = R. Various types of rotations can be used. Givens rotations 
are the fine-grained extreme : one-parameter 2x2 rotations that zero a single element of 
A and affect only two of its rows. Householder reflections are coarser-grained reflections 
in hyperplanes 1 — designed to zero an entire below-diagonal column of A and 

affecting all elements of A in or below the diagonal row of that column. Intermediate 
sizes of Householder reflections can also be used, the 2x2 case being computationally 
equivalent, and equal up to a sign, to the corresponding Givens rotation. This is useful for 
sparse QR decompositions, e.g. multifrontal methods (see §6.3 and [11]). The Householder 
method is the most common one for general use, owing to its speed and simplicity. Both 
the Givens and Householder methods calculate R explicitly, but Q is not calculated directly 
unless it is explicitly needed. Instead, it is stored in factorized form (as a series of 2 x 2 
rotations or Householder vectors), and applied piecewise when needed. In particular, b 

is needed to solve the least squares system, but it can be calculated progressively as part of 
the decomposition process. As for Cholesky decomposition, QR decomposition is stable 
without pivoting so long as A has full column rank and is not too ill-conditioned. For 
degenerate A, Householder QR decomposition with column exchange pivoting can be 
used. See [11] for more information about QR decomposition. 

Both QR decomposition of A and Cholesky decomposition of the normal matrix A^ A 
can be used to calculate the Cholesky / QR factor R and to solve least squares problems 
with design matrix / Jacobian A. The QR method runs about as fast as the normal / Cholesky 
one for square A, but becomes twice as slow for long thin A {i.e. many observations in 
relatively few parameters). However, the QR is numerically much stabler than the normal / 
Cholesky one in the following sense: if A has condition number (ratio of largest to smallest 
singular value) c and the machine precision is e, the QR solution has relative error 0{ce), 
whereas the normal matrix A^ A has condition number (? and its solution has relative error 
O(c^e). This matters only if approaches the relative accuracy to which the solution 
is required. For example, even in accurate bundle adjustments, we do not need relative 
accuracies greater than about 1 : 10®. As e ~ 10“^® for double precision floating point, 
we can safely use the normal equation method for c(J) < 10®, whereas the QR method is 
safe up to c(J) < 10^®, where J is the bundle Jacobian. In practice, the Gauss-Newton / 
normal equation approach is used in most bundle implementations. 

Individual Householder reflections are also useful for projecting parametrizations of 
geometric entities orthogonal to some constraint vector. For example, for quaternions 
or homogeneous projective vectors X, we often want to enforce spherical normalization 
||X|p = 1. To first order, only displacements ^X orthogonal to X are allowed, X^^X = 0. 
To parametrize the directions we can move in, we need a basis for the vectors orthogonal 
to X. A Householder reflection Q based on X converts X to (1 0 . . . 0)^ and hence the 
orthogonal directions to vectors of the form (0 * . . . *)^. So if U contains rows 2-n of Q, 
we can reduce Jacobians ^ to the n — 1 independent parameters of the orthogonal 
subspace by post-multiplying by U^, and once we have solved for Ju, we can recover 
the orthogonal <5X « U by premultiplying by U. Multiple constraints can be enforced 
by successive Householder reductions of this form. This corresponds exactly to the LQ 
method for solving constrained least squares problems [11]. 
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L = profile_cholesky_decomp(A) 
for i = 1 to n do 

for j = first(i) to i do 
j-i 

U = Aij ^ ^ \-ik Lj'fc 

fc=max(first(i),first(j)) 



Lij = (j < i) ? a / \-jj : ^ 



X = profile_cholesky_forward_subs(A, b) 
for i = first(b) to n do 



Xi = 




^ ^ ^ik Xfc 
A:=max(first(i) ,first(b)) 



/ Lii 



y = profile_cholesky_back_subs(A, x) 
y = X 

for i — last(b) to 1 step —1 do 

for k = max(first(i), first(y)) to i do 
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Fig. 10. A complete implementation of profile Cholesky decomposition. 



B.3 Profile Cholesky Decomposition 

One of the simplest sparse methods suitable for bundle problems is profile Cholesky 
decomposition. With natural (features then cameras) variable ordering, it is as efficient 
as any method for dense networks (i.e. most features visible in most images, giving dense 
camera-feature coupling blocks in the Hessian). With suitable variable ordering^®, it is also 
efficient for some types of sparse problems, particularly ones with chain-like connectivity. 

Figure 10 shows the complete implementation of profile Cholesky, including decom- 
position L = A, forward substitution X = b, and back substitution y = X. 
first(b), last(b) are the indices of the first and last nonzero entries of b, and tirst(z) is the 
index of the first nonzero entry in row i of A and hence L. If desired, L, X, y can overwrite 
A, b, X during decomposition to save storage. As always with factorizations, the loops can 
be reordered in several ways. These have the same operation counts but different access 
patterns and hence memory cache localities, which on modern machines can lead to sig- 
nificant performance differences for large problems. Here we store and access A and L 
consistently by rows. 



B.4 Matrix Inversion and Covariances 

When solving linear equations, forward-backward substitutions (22, 24) are much faster 
than explicitly calculating and multiplying by A \ and numerically stabler too. Explicit 
inverses are only rarely needed, e.g. to evaluate the dispersion (“covariance”) matrix 
Covariance calculation is expensive for bundle adjustment ; no matter how sparse H may be, 
is always dense. Given a triangular decomposition A = L D U, the most obvious way 
to calculate A“^ is via the product L k where (which is lower triangular) 

is found using a recurrence based on either L = 1 or L = 1 as follows (and similarly 

but transposed for U); 

= (U)-\ (L-),i = -L-; = - f E 

\k^i / 

(52) 

Snay’s Banker’s strategy (§6.3, [101, 24]) seems to be one of the most effective ordering strategies. 
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Alternatively [45, 11], the diagonal and the (zero) upper triangle of the linear system 
U A can be combined with the (zero) lower triangle of L = to give 

the direct recursion (i = n . . .1 and j = n . . .i + 1): 



= - E L,. Lrl, {A-% = ^ (A-)^, 









{A-% = U- D-L)^- ^ = U-D-- X] (A-),fcU, Lt 



k—i-\-l 



k—i-\-l 



(53) 



In the symmetric case {A^^)ji = (A“^)y so we can avoid roughly half of the work. If only a 
few blocks of A“^ are required (e.g. the diagonal ones), this recursion has the property that 
the blocks of A^^ associated with the filled positions of L and U can be calculated without 
calculating any blocks associated with unfilled positions. More precisely, to calculate 
for which Lji (j > i) or Uji (j < i) is non-zero, we do not need any block 
(A“^)fc; for which Lik — 0 (I > k) or U/fc = 0 (/ < A:) This is a significant saving if 
L, U are sparse, as in bundle problems. In particular, given the covariance of the reduced 
camera system, the 3D feature variances and feature-camera covariances can be calculated 
efficiently using (53) (or equivalently (17), where A t— is the block diagonal feature 
Hessian and D 2 is the reduced camera one). 



B.5 Factorization Updating 



For on-line applications (§8.2), it is useful to be able to update the decomposition A = 
L D U to account for a (usually low-rank) change A— >A = A±BWC. Let B = B 
and C = C so that A = D ± B W C. This low-rank update of D can be LDU 
decomposed efficiently. Separating the first block of D from the others we have : 






1 ±Di^BiWC2j 



(°'d2)±(|)w(CiC2) = ( 

Di = Di±BiWCi D 2 = D2±B2 (w=fWCiD;^'BiW)C2 



(54) 



D2 is a low-rank update of D2 with the same C2 and B2 but a different W. Evaluating 
this recursively and merging the resulting L and U factors into L and U gives the updated 



This holds because of the way fill-in occurs in the LDU decomposition. Suppose that we want to 
find (A^^)ij, where ji > i and Lji 7 ^ 0. For this we need (A^^)fcj for all non-zero Uik,k > i. But 
for these Ajk ~ Lji Di Uifc -I- . . . -f Ajk 7 ^ 0, so (A^^)kj is associated with a filled position and 
will already have been evaluated. 
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decomposition^^ A = L D U : 



W(i) ^ ±W; ^ B; C; 

for i = 1 to n do 

B, Bf^ C, ^ D, ^ D, + B,WWC,; 

W(*+1) ^ W«-WWC*d3,WW = ((W«)-i + C*Dt'B,) 



for j = i + 1 to n do 

B^+i) ^ b«-L,,B,; 

C(.+i) ^ cf - C, U,, ; 



^ L,, + b5.*+'^W(*+i)C,Dt^; 
U,, 4- U,, + Dt^B,W(*+i)c5*+'); 



(55) 



The form of the W update is numerically stabler for additions (‘+’ sign in A ± B W C 
with positive W), but is not usable unless is invertible. In either case, the update 
takes time 0{{k^ + b‘^)N'^) where AisNxN,\Niskxk and the are 6x6. So other 
things being equal, k should be kept as small as possible (e.g. by splitting the update into 
independent rows using an initial factorization of W, and updating for each row in turn). 
The scalar Cholesky form of this method for a rank one update A — s- A + in b is : 



4— w; b*^^) 4- b; 

for 1 = 1 to n do 

bi 4— b\ ^ di 4— 1 + bj ; \-ii 4— \-ii\fdi\ 

for j = 1 + 1 to n do 

b(*+i) ^ bf - L,, b, ; L,, ^ (l,, + in(®+i) b,) ^/I , ; 

(56) 



This takes 0(n^) operations. The same recursion rule (and several equivalent forms) can 
be derived by reducing (L b)^ to an upper triangular matrix using Givens rotations or 
Householder transformations [43, 1 1]. 



C Software 

C.l Software Organization 

For a general purpose bundle adjustment code, an extensible object-based organization is 
natural. The measurement network can be modelled as a network of objects, representing 
measurements and their error models and the different types of 3D features and camera 
models that they depend on. It is obviously useful to allow the measurement, feature and 
camera types to be open-ended. Measurements may be 2D or 3D, implicit or explicit, and 
many different robust error models are possible. Features may range from points through 
curves and homographies to entire 3D object models. Many types of camera and lens 

Here, B« = B, - and C« = C,- - EE\CfcU, = 

Efc^i Cfc Ufcj accumulate BandC U^®. For the L, U updates one can also use Ci Dj"® = 

W6) Ci D”' and Dt® B, W6+®) = Bi W'®). 




362 



B. Triggs et al. 



distortion models exist. If the scene is dynamic or articulated, additional nodes representing 
3D transformations (kinematic chains or relative motions) may also be needed. 

The main purpose of the network structure is to predict observations and their Jacobians 
w.r.t. the free parameters, and then to integrate the resulting first order parameter updates 
back into the internal 3D feature and camera state representations. Prediction is essentially 
a matter of systematically propagating values through the network, with heavy use of the 
chain rule for derivative propagation. The network representation must interface with a 
numerical linear algebra one that supports appropriate methods for forming and solving the 
sparse, damped Gauss-Newton (or other) step prediction equations. A fixed-order sparse 
factorization may suffice for simple networks, while automatic variable ordering is needed 
for more complicated networks and iterative solution methods for large ones. 

Several extensible bundle codes exist, but as far as we are aware, none of them are 
currently available as freeware. Our own implementations include: 

• Carmen [59] is a program for camera modelling and scene reconstruction using itera- 
tive nonlinear least squares. It has a modular design that allows many different feature, 
measurement and camera types to be incorporated (including some quite exotic ones 
[56, 63]). It uses sparse matrix techniques similar to Brown’s reduced camera system 
method [19] to make the bundle adjustment iteration efficient. 

• Horatio (http://www.ee.surrey.ac.uk/Personal/RMcLauchlan/horatio/html, [85], [86], 
[83], [84]) is a C library supporting the development of efficient computer vision ap- 
plications. It contains support for image processing, linear algebra and visualization, 
and will soon be made publicly available. The bundle adjustment methods in Horatio, 
which are based on the Variable State Dimension Filter (VSDF) [83, 84], are being 
commercialized. These algorithms support sparse block matrix operations, arbitrary 
gauge constraints, global and local parametrizations, multiple feature types and camera 
models, as well as batch and sequential operation. 

• VXL : This modular C-H- vision environment is a new, lightweight version of the Tar- 
getlr/IUE environment, which is being developed mainly by the Universities of Oxford 
and Leuven, and General Electric CRD. The initial public release on 
http://www.robots.ox.ac.uk/~vxl will include an OpenGL user interface and classes 
for multiple view geometry and numerics (the latter being mainly C-H- wrappers to well 
established routines from Netlib — see below). A bundle adjustment code exists for it 
but is not currently planned for release [28, 62]. 



C.2 Software Resources 

A great deal of useful numerical linear algebra and optimization software is available on 
the Internet, although more commonly in Fortran than in C/C-H-. The main reposi- 
tory is Netlib at http://www.netlib.org/. Other useful sites include: the ‘Guide to Avail- 
able Mathematical Software’ GAMS at http://gams.nist.gov ; the NEOS guide http://www- 
fp.mcs.anl.gov/otc/Guide/, which is based in part on More & Wright’s guide book [90] ; and 
the Object Oriented Numerics page http://oonumerics.org. For large-scale dense linear al- 
gebra, LAPACK (http://www.netlib.org/lapack, [3]) is the best package available. However 
it is optimized for relatively large problems (matrices of size 1 00 or more), so if you are solv- 
ing many small ones (size less than 20 or so) it may be faster to use the older LINPACK and 
EISPACK routines. These libraries all use the BLAS (Basic Linear Algebra Subroutines) 
interface for low level matrix manipulations, optimized versions of which are available from 
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most processor vendors. They are all Fortran based, but C/C++ versions and interfaces 
exist (CLAPACK, 

http://www.netlib.org/clapack; LAPACK++, http://math.nist.gov/lapack++). For sparse 
matrices there is a bewildering array of packages. One good one is Boeing’s SPOOLES 
(http://www.netlib.0rg/linalg/sp00les/sp00les.2.2.html) which implements sparse Bunch- 
Kaufman decomposition in C with several ordering methods. For iterative linear system 
solvers implementation is seldom difficult, but there are again many methods and imple- 
mentations. The ‘Templates’ book [ 10 ] contains potted code. For nonlinear optimization 
there are various older codes such as MINPACK, and more recent codes designed mainly 
for very large problems such as MINPACK -2 (ftp://info.mcs.anl.gov/pub/MINPACK- 2 ) 
and FANCEFOT (http://www.cse.clrc.ac.uk/Activity/FANCEFOT). (Both of these latter 
codes have good reputations for other large scale problems, but as far as we are aware they 
have not yet been tested on bundle adjustment). All of the above packages are freely avail- 
able. Commercial vendors such as NAG (ttp://www.nag. co.uk) and IMSF (www.imsl.com) 
have their own optimization codes. 



Glossary 

This glossary includes a few common terms from vision, photogrammetry, numerical optimization 

and statistics, with their translations. 

Additional parameters : Parameters added to the basic perspective model to represent lens distor- 
tion and similar small image deformations. 

a-distribution: A family of wide tailed probability distributions, including the Cauchy distribu- 
tion (a = 1) and the Gaussian (a = 2). 

Alternation : A family of simplistic and largely outdated strategies for nonlinear optimization (and 
also iterative solution of linear equations). Cycles through variables or groups of variables, opti- 
mizing over each in turn while holding all the others fixed. Nonlinear alternation methods usually 
relinearize the equations after each group, while Gauss-Seidel methods propagate first order cor- 
rections forwards and relinearize only at the end of the cycle (the results are the same to first 
order). Successive over-relaxation adds momentum terms to speed convergence. See separa- 
ble problem. Alternation of resection and Intersection is a naive and often-rediscovered bundle 
method. 

Asymptotic limit: In statistics, the limit as the number of independent measurements is increased 
to infinity, or as the second order moments dominate all higher order ones so that the posterior 
distribution becomes approximately Gaussian. 

Asymptotic convergence: In optimization, the limit of small deviations from the solution, i.e. as 
the solution is reached. Second order or quadratically convergent methods such as Newton’s 
method square the norm of the residual at each step, while first order or linearly convergent 
methods such as gradient descent and alternation only reduce the error by a constant factor at 
each step. 

Banker’s strategy: See fill in, §6.3. 

Block: A (possibly irregular) grid of overlapping photos in aerial cartography. 

Bunch-Kauffman : A numerically efficient factorization method for symmetric indefinite matrices, 
A = F D F^ where L is lower triangular and D is block diagonal with 1x1 and 2x2 blocks 
(§6.2, B.l). 

Bundle adjustment: Any refinement method for visual reconstructions that aims to produce jointly 
optimal structure and camera estimates. 
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Calibration: In photogrammetry, this always means internal calibration of the cameras. See inner 
orientation. 

Central limit theorem : States that maximum likelihood and similar estimators asymptotically have 
Gaussian distributions. The basis of most of our perturbation expansions. 

Cholesky decomposition : A numerically efficient factorization method for symmetric positive def- 
inite matrices, A = L where L is lower triangular. 

Close Range : Any photogrammetric problem where the scene is relatively close to the camera, 
so that it has significant depth compared to the camera distance. Terrestrial photogrammetry as 
opposed to aerial cartography. 

Conjugate gradient: A cleverly accelerated first order iteration for solving positive definite linear 
systems or minimizing a nonlinear cost function. See Krylov subspace. 

Cost function : The function quantifying the total residual error that is minimized in an adjustment 
computation. 

Cramer-Rao bound : See Fisher information. 

Criterion matrix: In network design, an ideal or desired form for a covariance matrix. 

Damped Newton method : Newton’s method with a stabilizing step control policy added. See 

Levenberg-Marquardt. 

Data snooping: Elimination of outliers based on examination of their residual errors. 

Datum : A reference coordinate system, against which other coordinates and uncertainties are mea- 
sured. Our principle example of a gauge. 

Dense : A matrix or system of equations with so few known-zero elements that it may as well be 
treated as having none. The opposite of sparse. For photogrammetric networks, dense means that 
the off-diagonal structure-camera block of the Hessian is dense, i.e. most features are seen in most 
images. 

Descent direction : In optimization, any search direction with a downhill component, i.e. that locally 
reduces the cost. 

Design : The process of defining a measurement network (placement of cameras, number of images, 
etc.) to satisfy given accuracy and quality criteria. 

Design matrix: The observation-state Jacobian J = 

Direct method : Dense correspondence or reconstruction methods based directly on cross-correlat- 
ing photometric intensities or related descriptor images, without extracting geometric features. 
See least squares matching, feature based method. 

Dispersion matrix: The inverse of the cost function Hessian, a measure of distribution spread. In 
the asymptotic limit, the covariance is given by the dispersion. 

Downdating: On-the-fly removal of observations, without recalculating everything from scratch. 
The inverse of updating. 

Elimination graph: A graph derived from the network graph, describing the progress of fill in 
during sparse matrix factorization. 

Empirical distribution: A set of samples from some probability distribution, viewed as an sum- 
of-delta-function approximation to the distribution itself. The law of large numbers asserts that 
the approximation asymptotically converges to the true distribution in probability. 

Eill-ln : The tendency of zero positions to become nonzero as sparse matrix factorization progresses. 
Variable ordering strategies seek to minimize fill-in by permuting the variables before factor- 
ization. Methods include minimum degree, reverse Cuthill-McKee, Banker’s strategies, and 
nested dissection. See §6.3. 

Eisher information : In parameter estimation, the mean curvature of the posterior log likelihood 
function, regarded as a measure of the certainty of an estimate. The Cramer-Rao bound says that 
any unbiased estimator has covariance > the inverse of the Fisher information. 
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Free gauge / free network: A gauge or datum that is defined internally to the measurement net- 
work, rather than being based on predefined reference features like a fixed gauge. 

Feature based : Sparse correspondence / reconstmction methods based on geometric image features 
(points, lines, homographies . . . ) rather than direct photometry. See direct method. 

Fiitering: In sequential problems such as time series, the estimation of a current value using all 
of the previous measurements. Smoothing can correct this afterwards, by integrating also the 
information from future measurements. 

First order method / convergence : See asymptotic convergence. 

Gauge : An internal or external reference coordinate system defined for the current state and (at least) 
small variations of it, against which other quantities and their uncertainties can be measured. The 
3D coordinate gauge is also called the datum. A gauge constraint is any constraint fixing a 
specific gauge, e.g. for the current state and arbitrary (small) displacements of it. The fact that 
the gauge can be chosen arbitrarily without changing the underlying structure is called gauge 
freedom or gauge invariance. The rank-deficiency that this transformation-invariance of the cost 
function induces on the Hessian is called gauge deficiency. Displacements that violate the gauge 
constraints can be corrected by applying an S-transform, whose linear form is a gauge projection 
matrix Pg . 

Gauss-Markov theorem : This says that for a linear system, least squares weighted by the true 
measurement covariances gives the Best (minimum variance) Linear Unbiased Estimator or BLUE. 

Gauss-Newton method : A Newton-like method for nonlinear least squares problems, in which the 
Hessian is approximated by the Gauss-Newton one H « W J where J is the design matrix 
and W is a weight matrix. The normal equations are the resulting Gauss-Newton step prediction 
equations (J^ W J) Sx = — (J W Az). 

Gauss-Seidel method : See alternation. 

Givens rotation: A 2 x 2 rotation used to as part of orthogonal reduction of a matrix, e.g. QR, 

SVD. See Householder reflection. 

Gradient: The derivative of the cost function w.r.t. the parameters 9 = ^ • 

Gradient descent: Naive optimization method which consists of steepest descent (in some given 
coordinate system) down the gradient of the cost function. 

(J2f 

Hessian: The second derivative matrix of the cost function H = . Symmetric and positive (semi- 

)definite at a cost minimum. Measures how ‘stiff’ the state estimate is against perturbations. Its 
inverse is the dispersion matrix. 

Householder reflection: A matrix representing reflection in a hyperplane, used as a tool for or- 
thogonal reduction of a matrix, e.g. QR, SVD. See Givens rotation. 

Independent model method : A suboptimal approximation to bundle adjustment developed for 
aerial cartography. Small local 3D models are reconstructed, each from a few images, and then 
glued together via tie features at their common boundaries, without a subsequent adjustment to 
relax the internal stresses so caused. 

Inner: Internal or intrinsic. 

Inner constraints : Gauge constraints linking the gauge to some weighted average of the recon- 
structed features and cameras (rather than to an externally supplied reference system). 

Inner orientation: Internal camera calibration, including lens distortion, etc. 

Inner reliability : The ability to either resist outliers, or detect and reject them based on their residual 
errors. 

Intersection: (of optical rays). Solving for 3D feature positions given the corresponding image 
features and known 3D camera poses and calibrations. See resection, alternation. 

Jacobian: See design matrix. 
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Krylov subspace: The linear subspace spanned by the iterated products b\k = 0 ... n} of 
some square matrix A with some vector b, used as a tool for generating linear algebra and nonlinear 
optimization iterations. Conjugate gradient is the most famous Krylov method. 

Kullback-Leibler divergence: See relative entropy. 

Least squares matching: Image matching based on photometric intensities. See direct method. 

Levenberg-Marquardt: A common damping (step control) method for nonlinear least squares 
problems, consisting of adding a multiple AD of some positive definite weight matrix D to the 
Gauss-Newton Hessian before solving for the step. Levenberg-Marquardt uses a simple rescaling 
based heuristic for setting A, while trust region methods use a more sophisticated step-length 
based one. Such methods are called damped Newton methods in general optimization. 

Local model : In optimization, a local approximation to the function being optimized, which is easy 
enough to optimize that an iterative optimizer for the original function can be based on it. The 
second order Taylor series model gives Newton’s method. 

Local parametrization : A parametrization of a nonlinear space based on offsets from some current 
point. Used during an optimization step to give better local numerical conditioning than a more 
global parametrization would. 

LU decomposition: The usual matrix factorization form of Gaussian elimination. 

Minimum degree ordering: One of the most widely used automatic variable ordering methods for 
sparse matrix factorization. 

Minimum detectable gross error : The smallest outlier that can be detected on average by an outlier 
detection method. 

Nested dissection: A top-down divide-and-conquer variable ordering method for sparse matrix 
factorization. Recursively splits the problem into disconnected halves, dealing with the separating 
set of connecting variables last. Particularly suitable for surface coverage problems. Also called 

recursive partitioning. 

Nested models: Pairs of models, of which one is a specialization of the other obtained by freezing 
certain parameters(s) at prespecified values. 

Network : The interconnection structure of the 3D features, the cameras, and the measurements that 
are made of them (image points, etc.). Usually encoded as a graph structure. 

Newton method: The basic iterative second order optimization method. The Newton step state 
update <5x = — minimizes a local quadratic Taylor approximation to the cost function at 
each iteration. 

Normal equations: See Ganss-Newton method. 

Nuisance parameter: Any parameter that had to be estimated as part of a nonlinear parameter 
estimation problem, but whose value was not really wanted. 

Outer: External. See inner. 

Outer orientation: Camera pose (position and angular orientation). 

Outer reliability: The influence of unremoved outliers on the final parameter estimates, i.e. the 
extent to which they are reliable even though some (presumably small or lowly- weighted) outliers 
may remain undetected. 

Outlier: An observation that deviates significantly from its predicted position. More generally, 
any observation that does not fit some preconceived notion of how the observations should be 
distributed, and which must therefore be removed to avoid disturbing the parameter estimates. 

See total distribution. 

Pivoting : Row and/or column exchanges designed to promote stability during matrix factorization. 

Point estimator: Any estimator that returns a single “best” parameter estimate, e.g. maximum 
likelihood, maximum a posteriori. 

Pose: 3D position and orientation (angle), e.g. of a camera. 
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Preconditioner : A linear change of variables designed to improve the accuracy or convergence rate 
of a numerical method, e.g. a first order optimization iteration. Variable scaling is the diagonal 
part of preconditioning. 

Primary structure : The main decomposition of the bundle adjustment variables into structure and 
camera ones. 

Profile matrix: A storage scheme for sparse matrices in which all elements between the first and 
the last nonzero one in each row are stored, even if they are zero. Its simplicity makes it efficient 
even if there are quite a few zeros. 

Quality control : The monitoring of an estimation process to ensure that accuracy requirements 
were met, that outliers were removed or down-weighted, and that appropriate models were used, 
e.g. for additional parameters. 

Radial distribution: An observation error distribution which retains the Gaussian dependence on 
a squared residual error r = W X, hut which replaces the exponential form with a more 

robust long-tailed one. 

Recursive: Used of filtering-based reconstmction methods that handle sequences of images or 
measurements by successive updating steps. 

Recursive partitioning: See nested dissection. 

Reduced problem: Any problem where some of the variables have already been eliminated by 
partial factorization, leaving only the others. The reduced camera system (20) is the result of 
reducing the bundle problem to only the camera variables. (§6.1, 8.2, 4.4). 

Redundancy: The extent to which any one observation has only a small influence on the results, 
so that it could be incorrect or missing without causing problems. Redundant consensus are the 
basis of reliability. Redundancy numbers r are a heuristic measure of the amount of redundancy 
in an estimate. 

Relative entropy : An information-theoretic measure of how badly a model probability density pi 
fits an actual one po : the mean (w.r.t. po) log likelihood contrast of po to pi , (log(po /pi ))po • 

Resection: (of optical rays). Solving for 3D camera poses and possibly calibrations, given image 
features and the corresponding 3D feature positions. See Intersection. 

Resection-intersection: See alternation. 

Residual: The error Az in a predicted observation, or its cost function value. 

S-transformation : A transformation between two gauges, implemented locally by a gauge pro- 
jection matrix Pg. 

Scaling: See preconditioner. 

Schur complement: Of A in ( Q Q ) is D — C A ^ B. See §6. 1 . 

Second order method / convergence : See asymptotic convergence. 

Secondary structure : Internal structure or sparsity of the off-diagonal feature-camera coupling 
block of the bundle Hessian. See primary structure. 

Self calibration: Recovery of camera (internal) calibration during bundle adjustment. 

Sensitivity number: A heuristic number s measuring the sensitivity of an estimate to a given 
observation. 

Separable problem: Any optimization problem in which the variables can be separated into two 
or more subsets, for which optimization over each subset given all of the others is significantly 
easier than simultaneous optimization over all variables. Bundle adjustment is separable into 3D 
structure and cameras. Alternation (successive optimization over each subset) is a naive approach 
to separable problems. 

Separating set: See nested dissection. 
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Sequential Quadratic Programming (SQP) : An iteration for constrained optimization problems, 
the constrained analogue of Newton’s method. At each step optimizes a local model based on a 
quadratic model function with linearized constraints. 

Sparse: “Any matrix with enough zeros that it pays to take advantage of them” (Wilkinson). 

State: The bundle adjustment parameter vector, including all scene and camera parameters to be 
estimated. 

Sticky prior : A robust prior with a central peak but wide tails, designed to let the estimate ‘unstick’ 
from the peak if there is strong evidence against it. 

Subset selection: The selection of a stable subset of ‘live’ variables on-line during pivoted factor- 
ization. E.g., used as a method for selecting variables to constrain with trivial gauge constraints 
(§9.5). 

Successive Over-Relaxation (SOR) : See alternation. 

Sum of Squared Errors (SSE) : The nonlinear least squares cost function. The (possibly weighted) 
sum of squares of all of the residual feature projection errors. 

Total distribution: The error distribution expected for all observations of a given type, including 
both inliers and outliers. I.e. the distribution that should be used in maximum likelihood estimation. 

Trivial gauge: A gauge that fixes a small set of predefined reference features or cameras at given 
coordinates, irrespective of the values of the other features. 

Trust region : See Levenberg-Marquardt. 

Updating: Incorporation of additional observations without recalculating everything from scratch. 

Variable ordering strategy : See fill-in. 

Weight matrix : An information (inverse covariance) like matrix matrix W, designed to put the 
correct relative statistical weights on a set of measurements. 

Woodbury formula: The matrix inverse updating formula (18). 
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Discussion for Session on Bundle Adjustment 



This section contains the discussion that followed the special panel session on 
bundle adjustment. 

Discussion 

Kenichi Kanatani: This is a question related to Richard’s talk. You have obser- 
vations and a camera model, you derive some error function and you minimize 
over all possible parameter values. But shouldn’t you also be optimizing over 
different camera models? 

Richard Hartley: That sounds a bit like a loaded question. You’re getting into 
the area of model selection for the cameras, which I know you’ve done some work 
on yourself. My point of view here is that usually you know what sort of camera 
you have, a perspective or push-broom camera or whatever. You may want to 
choose between an affine approximation and a full projective camera, or how 
many radial distortion parameters to include, but that’s beyond the scope of 
what I would normally call bundle adjustment. You often need to initialize and 
bundle adjust several times with different models, so that you can compare them 
and choose the best. I do that in my program, for example to decide whether 
points are coplanar or not. You can use scientific methods like AIC for it if you 
want. 

Bill Triggs: The place in photogrammetry where this really comes up is “addi- 
tional parameters” modelling things like lens distortion, non-flatness of film, etc, 
where there is no single best model. Such parameters improve the fit significantly, 
but if you add too many, yes, you overfit, and the results get worse. The usual 
decision criteria in photogrammetry are based on predicted covariance matrices. 
If you add a parameter and the estimated covariance of the something that you 
want to measure (3D points or whatever) jumps, or if there is more than say 
95% correlation between some pairs of parameters, you declare overfitting and 
back off. 

The other point is that in many cases, a little regularization — adding a 
small prior covariance — is enough to stabilize parameter combinations that are 
sometimes poorly controlled, without biasing the results too much in cases when 
they are better controlled by the measurements. So within limits, you can often 
just take a general model with lots of parameters and fit that, letting the prior 
smooth things out if necessary. 

P. Anandan: My question is also about model selection. With bundle-adjust- 
ment, if you have impoverished data — planar scene, narrow field of view, noisy 
points and maybe some outliers, things like that — the question is how stable is 
bundle-adjustment under these conditions. Assuming that you’ve got a decent 
initial estimate, is it likely that you’ll actually get the correct solution, or will 
this kind of error cause you to fail. Is there any wisdom on this? 
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Richard Hartley: My practical experience is that when these situations occur, 
data that’s too planar say, it’s best if you can find some sort of additional 
constraint to include. Maybe you don’t know the focal length, but you think 
it’s surely 1000 ± 500; or the points may not be coplanar, but they’re probably 
coplanar with some large deviation of ten-thousand feet say. That stabilizes your 
solution by putting soft constraints on it. 

Andrew Fitzgibbon: I agree, and this is also what Bill was talking about. If 
you’re viewing a planar scene you’re going to see large correlations in your focal 
lengths and motion parameters. If you damp some of the camera parameters 
when you discover these correlations, you’ll get a more reliable motion estimate 
from your planar scene. 

Richard Hartley: You can estimate the camera from a planar scene provided 
you start with constraints on the cameras, like known principal point. 

Bill Triggs: I think our responses so far have missed an important point. What 
is bundle adjustment? - It’s just minimization of your best guess at the true 
statistical error for the problem, over all the parameters that you think are im- 
portant for the problem. If bundle adjustment is unstable, that’s really another 
way of saying that you just don’t have enough information to stably estimate 
things in the situation you think you’re in. If some other method — say a lin- 
earized one — appears to give you stabler results, its lying. Either it’s biased, 
or it’s estimating a different error model or parameters. 

P. Anandan: Maybe I’m misunderstanding. Isn’t there also the approach of 
minimizing taking a local descent type of approach, so it’s not just the opti- 
mization function that could be wrong? 

Bill Triggs: Bundle adjustment is a local descent approach. Of course, you 
might have convergence to the wrong local minimum. 

P. Anandan: Yes, that’s what I’m talking about. . . 

Bill Triggs: If bundle adjustment gives you a wrong local minimum, that min- 
imum is a feature of the true statistical error surface. So it’s likely that other 
methods which attempt to approximate this surface will have similar behaviour 
from a similar initialization. 

Harry Shum: First a comment: who cares about bundle adjustment if you 
have a wrong model? Secondly, I want to ask Richard about camera models. 
You mentioned quite a few, and there were at least two that I wasn’t aware of 
— the 2D camera and the polynomial cameras. 

Richard Hartley: The 2D camera is just my name for the cases where you 
have a homography between the world and the images (no translation or planar 
scene), or where the camera is a mapping from a plane in 3D onto a line. That 
includes the linear push-broom sensors used in sensing satellites, and some X-ray 
sensors that image a point source of X-rays on a linear sensor. 

The rational polynomial camera is a general model used in the US intelligence 
community. It approximately fits a large number of different sensors: ordinary 
cameras, SAR, push-broom and push-sweep cameras, and lots of others that I’m 
unaware of. Basically, whereas an ordinary perspective camera maps a point in 
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space to the quotient of two linear polynomials in the 3D point coordinates, the 
rational polynomial camera uses a quotient of two higher degree polynomials. I 
have a paper on it in a DARPA lU workshop proceedings if you want the details. 
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1 Introduction 

The workshop ended with a 75 minute panel session, with Richard Hartley, P. 
Anandan, Jitendra Malik, Joe Mundy and Olivier Faugeras as panelists. Each 
panelist selected a topic related to the workshop theme that he felt was important 
and gave a short position statement on it, followed by questions and discussion. 



2 Richard Hartley 



Richard Hartley discussed error modelling and self-calibration, arguing that we 
should be willing to accept practical compromises rather than leaning too heavily 
towards theoretical ideals. 

As far as error modelling is concerned, he argued that the precise details 
of the assumed error model do not usually seem to be very important in prac- 
tice. So given that we seldom know what the true underlying error model is, 
it does not seem worthwhile to go to great lengths to get the last few percent 
of precision. On the other hand, when combining grossly different error models 
it is important to compensate at least approximately, for example by applying 
normalizing transformations to the data. Rather than relying too much on the- 
oretical optima, you should choose a realistic performance metric (for example 
epipolar line distances in fundamental matrix estimation) and monitor how well 
you have done. 

On calibration vs. self-calibration, he argued that it was usually unwise to 
pretend that you know nothing at all about the cameras, as self-calibration from 
minimal data (such as only vanishing skew) is often very unstable. It is often 
more accurate in practice, e.g. to assume a known principal point at the centre of 
the image, rather than hoping that a completely unknown one will be estimated 
accurately enough. This sort of information can be included as a prior under 
maximum a posteriori (MAP) estimation. 
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3 P. Anandan 

P. Anandan focused on the limited domain of applicability of current 3D recon- 
struction algorithms, and of our reconstruction paradigm in general. He observed 
that for limited classes of scenes we can already do many things: 

1. Our geometric reconstruction algorithms are nearly optimal. Given a suitable 
set of initial correspondences (even with some outliers), we can recover multi- 
camera geometry and do ‘point-cloud’ 3D reconstruction nearly as well as 
the data will support. And for at least some classes of scene, suitable initial 
correspondences can also be recovered. 

2. Our techniques for 3D surface representation are also very advanced. We 
can do multi-base line stereo to get good quality dense 2.5-D maps, put the 
results into surface fitting techniques to get recover 3D shapes, etc. 

However, the domain of applicability of all this is limited. Today’s algorithms 
work fine for sparse 3D scenes with only a few objects, simple photometry and 
few or no inter-object occlusions. But they cannot handle: 

— Scenes with significant 3D clutter — your desk, shutters, tree branches 
through which the background is seen, etc. 

— Scenes with complex photometry, reflections from one surface onto another, 
translucency. Almost any scene with glass windows or shiny objects that 
reflect others cannot currently be processed. 

— Scenes with almost any nontrivial dynamics — moving people, etc. 

The point is that if you casually shoot a video in an ordinary everyday envi- 
ronment, it will almost certainly not be processable by current reconstruction 
algorithms. 

To make progress, grouping and segmentation need to be more tightly inte- 
grated with reconstruction. One good way to do this is layered representations 
built using mixture models and Bayesian estimation. Layers have relatively sim- 
ple geometry, are capable of handling occlusions and transparency, and provide 
a natural model for both local continuity and global occlusion coherence. 

4 Jitendra Malik 

Jitendra Malik argued that “the vision community should return to vision from 
photogrammetry” : 

1 . We should declare victory on the reconstruction of 3D geometry. The mathe- 
matics of SFM, shape-from-texture, reconstruction from monocular views of 
constrained classes such as symmetric objects and SOR’s, etc, is now worked 
out, or is at the 95% point. Even though we do not yet have reliable fully 
automatic reconstruction software, the hurdles lie elsewhere. 

2. The correspondence problem, particularly dense correspondence as needed 
for surfaces, is far from being solved. This problem cannot be solved robustly 
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and accurately by treating it as a pure, modularized matching problem in 
isolation from other visual processing. Variations on themes of maximizing 
image cross-correlation can only go so far. Regions without texture and dis- 
continuities will be the bane of all such algorithms. Coupling correspondence 
with structure recovery helps but that too is only part of the solution. 

3. The era where SFM (and other geometrically posed problems) could be 
profitably studied in isolation is over. Problems like correspondence are in- 
timately linked to grouping and perceptual organization, perhaps even to 
recognition. Many apparent difficulties in correspondence simply vanish in 
such an integrated view. E.g. we can match untextured regions on a group 
to group basis, aligning sharply at discontinuities using the monocular in- 
formation. 

Several of these issues have the tinge of the bad old days of Al-inspired vision, 
with lots of talk but no real, solid results. But there is an ongoing revival, based 
on powerful new techniques from probabilistic modeling, graph theory, learning 
theory etc. Some (biased) samples: his own group’s work on image segmentation; 
Andrew Blake’s group’s work on tracking; Forsyth’s and Zhu’s use of MCMC 
techniques for recognition. He ended by encouraging more people to “join the 
good fight” . . . 



Discussion 

The reaction from the audience in the discussion that followed Jitendra Malik’s 
presentation was generally supportive: 

Jean Ponce: I completely agree with Jitendra, but I think that the problem is 
not with geometry alone. People have also looked at photometry, illumination 
etc, and all of these fail because of poor segmentation. Looking at inference 
systems like probabilities is a good idea, but it’s still going to be very scary. So 
long as we don’t know how to solve segmentation, vision is going to remain a 
very dirty problem. I find this a little discouraging, but I think that it’s the good 
fight, and that we should all be working on it. 

Rick Szeliski: I want to thank Jitendra for a very stimulating position. I think 
that a positive example of the potential of learning techniques is the tremendous 
advances made in face recognition and matching. For example Thomas Vetter’s 
and Chris Taylor’s groups have both been able to build 3D models from single 
images. If we’re willing to consider constrained sub-problems, and if we have 
enough training data to learn from, then working systems can be built that do 
not necessarily have any sort of geometric interpretation, but that are able to 
learn to do the right thing by example. 



5 Joe Mundy 

Joe Mundy discussed systems issues. It now takes man years of effort to develop 
a competent 3-D vision system. These difficulties cut across various subject areas 
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and levels of representation, and they present a formidable barrier to the next 
generation of applications. 

Code is currently developed essentially one thesis at a time. There is little 
representational unity. Cross-fertilization between (or even within) labs is lim- 
ited, so a great deal of time is wasted reinventing things that have already been 
done. For example it cost GE 1-2 man years of effort to start working on video. 
Also, graduate students are not usually experienced software designers, and rel- 
atively little of their code is stable and well-organized enough to be reused. 

He then gave a brief history of the TargetJr C-|— I- vision environment, and 
suggested that it was a reasonable success: it has been used in a number of 
projects, it supplies a unified environment for student training, and code sharing 
across labs has actually occurred. However the current version is not perfect: it 
is a big system (100k lines of code) requiring a heavy infra-structure, and C-|— I- 
is not easy (tests showed that only 1 in 10 C-|— I- programmers really know the 
language). It is likely that TargetJr will evolve to a more open, modular, client- 
server architecture with a lightweight Java GUI, a Corba or DCOM backplane, 
and specialist modules for the various vision and graphics tasks. 

He finished by advocating the Linux/ Open Source software model as the 
most suitable way to sustain such large, collaborative software efforts. Any re- 
strictions on distribution or commercial use prevent potential collaborators from 
contributing. (An example is the ‘pay for commercial use’ licence of the Esprit 
CGAL computational geometry library, http://www.cs.uu.nl/CGAL, which ex- 
cluded it from consideration for TargetJr). 

In the discussion, Richard Szeliski suggested that standardization was diffi- 
cult when there was disagreement over representations, which algorithms to use, 
and details like border effects. Joe Mundy replied that in some areas such as 
spatial indexing and Hough voting, there was enough agreement to standardize 
many representations and algorithms, and in more difficult ones such as edge 
detection at least the input and output representation could be standardized, as 
in TargetJr which supports a number of different edge detectors. 



6 Olivier Faugeras 

Olivier Faugeras made a number of points in support of the geometric approach 
to vision: 

1. If we believe Popper’s definition of scientific knowledge based on the falsifia- 
bility of theories, we have to accept that like physics, chemistry, biology and 
computer science, vision geometry — structure from motion and so forth — 
is a science. The objects of study are well defined. Quantitative theories exist 
and can be tested and criticized by independent observers. But the softer 
areas of computer vision (correspondence, perceptual organization . . . ) are 
not currently sciences. What phenomena are they trying to study and make 
quantitative theories of? What experimental procedures can be used to in- 
validate these theories? 
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2. Far from being irrelevant to ‘real’ vision, the geometric theories developed 
over the last 10 years to model structure from motion, self-calibration, strat- 
ified vision, etc, are highly pertinent. They both deepen our understanding 
of what is going on when we point cameras at the world, and help to make 
practical applications easier to solve. Also, representations of knowledge are 
and should be application dependent. Geometry often turns out to be the 
most appropriate representation. 

3. On the future of vision geometry, he suggested that geometric theorem prov- 
ing may prove to be an effective tool for many problems, then argued that 
the old AI vision problems need to be seriously reexamined from the stand- 
point of both biological vision and mathematics, e.g. by asking what it really 
means to segment a set of images. 

Discussion 

Michal Irani: I must say I’m a little puzzled by your definition of science. 
According to you, something becomes science only after it has been solved and 
you have the theory for it. So when Galileo had the hypothesis that the earth 
is round, he was dealing with religion not science, and he was probably being 
naive. That’s my interpretation of your definition. 

Olivier Faugeras: I guess my statement was unclear. What I’m saying is that 
theories live and die. Science is the way you build and refute them, so the problem 
has not been solved just because you have a theory. The Gopernican theory of 
gravitational attraction was shown to be wrong by Newton, for example. The 
important thing is that theories must be criticized, and this doesn’t happen 
enough in our field. People come up with theories for structure from motion, but 
they’re not being criticized enough. 

P. Anandan: I think that it’s unnecessary to worry about whether vision is a 
science or not. It’s a field of study that has scientific components, but also other 
components. Also, I wanted say that Olivier made an important point in iden- 
tifying vision and AI as having the same sort of difficulties. Ultimately, AI and 
vision are about trying to develop a theory of reasoning. I think that reasoning 
is the only phenomenon that we could actually identify as being studied at a 
theoretical level. All the other things with geometry and optics are important, 
but the problem that I think eludes us, and will continue to elude us is a theory 
of reasoning. 

Jitendra Malik: I think that we should be very clear about what vision is. 
Vision is not a science, and structure from motion is not a science. Vision is 
a sort of a hybrid field with several components: mathematics, science, and 
engineering. 

First, the mathematical component. This has a precise formulation such as 
“given k points in n views can I recover A” . Most of the basic results in structure 
from motion are mathematical theorems, and for a theorem you have a simple 
test: is it valid or not? Mathematics is a good thing, but it’s not a science. 
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Secondly there’s science, which is fundamentally about empirical phenomena. 
Biological vision is a science, but I’m not sure whether computer vision is. If you 
have phenomena, you can construct models for them, compare with empirical 
results, and see how they fit. I think that the view of science being primarily 
driven by Popper’s falsifiability is a bit naive. We have to think of it more in terms 
of Kuhn’s paradigm shifts. It is an evolutionary process. It’s true that in the end, 
scientific papers get written in a falsifiable sense, with some experiment, control, 
and so forth. But the way things actually happen is that there is some collection 
of facts, vaguely understood, then some model of them, then somebody else 
comes along with some inconvenient facts. Models stick around, or get replaced 
finally by new ones. That’s the way it works, it doesn’t get falsified with a single 
fact. 

Finally, there is the engineering aspect of computer vision. For this we have 
very clear criteria. Ultimately, in engineering there are specific tasks with specific 
performance criteria. You want to recover depth? - Well, measure the depth, say 
what your computer vision algorithm does, compute a percentage error. It’s very 
simple. 

But since we are in a field that has all of these things mixed together, what 
we have to do is just keep trying to do good stuff. It’s more important to do 
good stuff than to worry about whether it is maths, science, or engineering. 



7 General Discussion 

The discussion and questions after the presentations was rather diverse and can 
only be summarized briefly here. The point that came out most clearly was that 
people had never forgotten about segmentation, perceptual organization and so 
forth. But for a time there was a feeling that new ideas were needed before we 
could start making progress on these problems, whereas vision geometry was 
developing very rapidly. However the tide appears to be turning: there seemed 
to be a remarkably strong consensus (at least among the vocal minority!) that 
we will make significant advances on these difficult mid- and high-level vision 
problems in the near future, notably: 

1. By exploiting domain constraints, using purpose built parametric models 
and optimization/learning over a suitable training set {e.g. for face repre- 
sentation, medical applications, as in work by Thomas Vetter, Chris Taylor, 
Andrew Blake). 

2. By applying ‘new wave Bayesian’ methods — probabilistic networks, HMM’s 
and MRF’s for knowledge representation, sampling (MCMC, Condensation) 
for calculations, parameter and structure learning algorithms. 

Nobody (vocally) disagreed either that significant advances would be made, or 
that these rather mathematical tools were the appropriate ones. Even the sugges- 
tion that Bayesian networks might prove to be an effective model of some aspects 
of human cognition went unchallenged. This consensus was perhaps surprising 
given the geometric orientation of the workshop. Certainly, we would not have 
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expected to find it a few years ago. But large-scale statistics and optimization 
seem now to have become main stream tools. 

Comfortingly, there was also a consensus that whether computer vision counts 
as mathematics, science, or engineering, it is a domain worth studying. 
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